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Résumé 


Les réactions biochimiques sous-jacentes au fonctionnement des cellules sont des processus in- 
trinséquement stochastiques. En conséquence, le fonctionnement de la cellule, considérée comme 
un systéme, est aléatoire en raison des fluctuations de ses composantes fondamentales. Parmi 
ces dernières se trouvent les protéines, qui jouent un rôle majeur dans les cellules. Le caractère 
stochastique des protéines est tel qu’il est même responsable des différences observées dans le 
phénotype et ce même dans le cas de cellules clonées exposées à des conditions environnementales 
identiques. La grande difficulté rencontrée dans le développement de techniques quantitatives 
fiables pour la mesure des fluctuations de l’expression génétique au niveau cellulaire à favorisé 
le développement et l’utilisation de modèles stochastiques essayant de capturer les principales 
caractéristiques du système. Il est donc crucial que les modèles adoptent des hypothèses réalistes, 
afin de pouvoir les utiliser comme un véritable outil d'investigation. 

Dans ce travail de thèse nous avons mis en place un nouveau cadre mathématique basé 
sur les Processus Ponctuels de Poisson Marqués (MPPP) pour décrire les principales étapes de 
la production d’une protéine spécifique, grâce à une analogie entre le système de production de 
protéines et les réseaux de files d’attente. Cette approche s’est avérée être très adaptée à la tâche, 
car elle permet de considérer des hypothèses générales pour certaines étapes, tout en gardant le 
caractère analytique des modèles présents dans la littérature, qui se réduisent à un cas particulier 
de cette approche générale pour des hypothèses spécifiques. Avec ce cadre, nous avons réussi 
à surmonter l’hypothèse fondamentale et restrictive des modèles classiques, qui exige une durée 
exponentielle de toutes les étapes. La description non-markovienne de l’expression génétique 
obtenue grâce à ce nouveau cadre a permis d’aborder le problème d’une manière plus satisfaisante 
et, en particulier, de proposer un modèle plus réaliste qui comprend des hypothèses réalistes de 
l’étape d’élongation de la protéine et de la dilution des protéines en raison de la croissance du 
volume. Eu égard aux résultats obtenus, cette nouvelle approche a montré que les modèles 
classiques ont su capturer qualitativement les caractéristiques du bruit dans l’expression d’un 
gène, mais leur pouvoir de prédiction est limité par le fait que les formules quantitatives obtenues 
sont incorrectes. L'utilisation des MPPP a permis de d'évaluer l'impact de différents choix de 
modélisation, tout en gardant la capacité d’obtenir des formules analytiques pour les premiers 
moments des distributions des différents processus en fonction des paramètres biophysiques. 

La modélisation du processus de production d’une seule protéine, bien que puissante, ne prend 
pas en compte la description des interactions qui peuvent se produire en raison de la production 
simultanée des différents types de protéines. Pour cette raison, nous avons proposé une première 
modélisation de la production de plusieurs protéines en considérant les interactions comme le 
résultat de la compétition pour des ressources communes. Plus précisément, les ribosomes sont 
responsables de la traduction de tous les types de messagers et leur nombre est limité et stricte- 
ment contrôlé dans la cellule. En pratique, le coût élevé, en termes de ressources, associé à la 
production des ribosomes oblige la cellule à optimiser leur usage et il s'avère qu’ils sont presque 
toujours en train de traduire des messagers et restent très peu inactifs après l’achèvement de 
leur tâche. En conséquence, il y a une rude compétition entre les messagers pour avoir accès 
à des ribosomes libres et la production globale en est affectée. Le système de production est 
étudié par une approche de champ moyen à la fois dans les régimes de sous-charge et de sur- 
charge d'utilisation des ribosomes. Le modèle multi-protéines est une approche novatrice dans le 
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domaine qui ouvre une nouvelle direction dans l’étude des fluctuations des protéines au niveau 
cellulaire. 

En conclusion, la thèse a porté sur l’étude de la nature stochastique de l’expression génétique, 
en développant différents modèles afin de progresser vers une description plus réaliste des phéno- 
ménes. Toutes ces études ont été menées en essayant de mettre la biologie au premier plan, car 
nous croyons que ces modèles représentent un outil fondamental dans l’étude et la compréhension 
des processus biologiques complexes. 
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Abstract 


Biochemical reactions underlying the functioning of cells are inherently stochastic processes. 
As a consequence, the whole system is noisy and undergoes fluctuations in its fundamental com- 
ponents. Proteins are major players for the living. The behavior of cells as well as their stochastic 
character manifests itself via striking differences in phenotype. The differences are apparent even 
in the case of identical, cloned cells which underwent the same environmental conditions. The 
extreme difficulty in obtaining reliable experimental quantitative results concerning fluctuations 
in gene expression has fostered the development and intense use of stochastic models. These 
models aim at capturing the main characteristics of the system. Given the context, it is crucial 
that models embrace realistic assumptions. 

We introduced a new mathematical framework based on Marked Poisson Point Processes 
(MPPP) to describe the main steps of the production of a specific protein. We leveraged the 
similarities between the protein production system and queueing networks. This approach has 
proven to be perfectly suited, since it allowed us to consider general assumptions, while still 
permitting the derivation of analytical formulas. This is one of the key features of the models 
found in literature. Furthermore, the classic models are incorporated in this approach and well- 
known results can be obtained for specific assumptions. This has been possible, since we were able 
to overcome the restrictive assumption, crucial in classic framework, of exponentially distributed 
duration of all steps. The non-Markovian description of gene expression obtained through this 
new framework has permitted to tackle the problem in a more satisfying way and, in particular, 
propose a more realistic model of gene expression, which includes realistic assumptions of protein 
elongation step and protein dilution due to volume growth. Such a realistic model shows that if 
on one hand the classic models have captured the qualitative behavior of the underlying biological 
processes, on the other hand their quantitative results might have been inaccurate, resulting in 
a limited predictive power. The MPPP framework has proved to serve as a testing platform, 
allowing to quantify precisely the impact of different modeling choices, while keeping intact the 
ability to obtain analytical formulas of statistics depending on the biophysical parameters. 

The single-protein modelisation, although powerful, fails to describe the possible interactions 
deriving from the simultaneous production of different types of proteins. For this reason, we 
moved the first steps towards a modelisation of the production of many proteins, considering 
interactions as the result of the competition for common resources. In particular, ribosomes 
translate any type of messengers and turn to be present in limited and strictly controlled number 
within a cell. In fact, the high cost in term of resources associated with the production of 
ribosomes forces the system to have almost always active ribosomes translating messengers into 
proteins. As a consequence, messengers compete against each other for the rare resource of 
free ribosomes and the global production is affected. The system of production is studied via 
a mean-field approach both in the underloaded and overloaded regimes of use of the ribosomes. 
The multi-protein model brings a completely new approach in the domain and marks a new 
direction in the investigation of protein fluctuations at the cellular level. 

In conclusion, the thesis has focused on the study of the stochastic nature of gene expression, 
by developing different models in order to progress towards a more realistic description of the 
phenomena. All these studies have been conducted trying to put biology in the foreground, since 
we believe these models represent a fundamental step in the investigation and understanding of 
complex biological processes and are a complementary tool to biological experiments. 
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Chapter 1 


Introduction 


Aussi la biologie est-elle, pour l’homme, 
la plus signifiante de toutes les sciences; 
celle qui a déja contribué, plus que toute 
autre sans doute, à la formation de la 
pensée moderne, profondément 
bouleversée et définitivement marquée 
dans tous les domaines |...] 


Le hasard et la nécessité 
JACQUES Monop, 1970 


Nicolaus Copernicus proposed, in his De revolutionibus orbium coelestium, an heliocentric 
model for the celestial objects. This theory led to a deep wound in the narcissistic simulacrum 
that men built around the Ptolemaic model by putting away the man from the center of the 
Universe. 

The chance is then found to be responsible of the mutations and so of the evolution itself. If 
the Theory of Evolution was well established and phenomenologically accepted since the end of 
the XIX century, it required a physical theory of heredity in order to have a deeper acceptance 
and solid foundations. This is the aim of the molecular theory of the genetic code, which tries 
to connect concepts related to the chemical structure of the genetic material, the information 
it contains and the molecular mechanisms for the morphogenetic and physiologic expression of 
this information. The molecular biology has further pushed the man out of the center of the 
Universe, redefining life in terms of molecular interactions. 

In 1953, Watson and Crick proposed their helicoidal model of DNA structure, which, together 
with subsequent works, has given insights on how genetic information is stocked in cells and how 
it is possible to pass this information from a generation to the next. If DNA contains the 
morphogenetic and functional information of an individual, the fulfillment of the cell project is 
accomplished through the gene expression. The products of gene expression mainly consist of 
proteins and functional RNAs in the case of non-protein coding genes such as ribosomal RNA 
(rRNA) or transfer RNA (tRNA). The protein production is the central topic of this PhD work 
and we will focus on its description via a mathematical modeling. 

Since the early works of the 50’s, major advancements have taken place in the understanding of 
gene expression leading to an extremely detailed biochemical model of the underlying processes. 
However, if the description of the system is impressively highly detailed, there are still open 
questions on the fundamental mechanisms, urging a systemic analytic description in order to 
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identify the laws underlying gene expression. The encounter of different components is the 
fundamental step common to all elementary biochemical reactions. This, together with the fact 
that cytoplasm is a stiff disordered medium where different compounds move by diffusion, makes 
the whole process inherently stochastic. Although the first indirect experimental evidences can 
be traced back to the end of the 50’s [50], stochasticity has been clearly proven experimentally 
only in recent years [40]. Noise has become a central topic in molecular biology, bearing new 
challenges in order to reconcile the random biochemical reactions to the precise development of 
organisms. It is clear that cells and in particular gene expression are robust with respect to 
fluctuations both in the environment and in its fundamental components. If noise is firstly an 
obstacle to the accomplishment of the genetic program, in some specific situations fluctuations 
can be exploited, as in the case of the decision making process. However, despite observable 
positive or negative consequences of noise, those mechanisms represent a small portion of the 
whole picture and a characterization of the status of noise is still an open question. 


Q: Before this knowledge how can we characterize the underlying stochastic process via a math- 
ematical analysis? 


The objective of mathematical biology is to extract information on the system and on its 
specific mechanisms, via a mathematical description and characterization. Nevertheless, there is 
a number of difficulties connected with gene expression due to the colossal number of elementary 
steps and the mixing of deterministic and stochastic processes. Moreover, a complex interaction 
pathway links different processes both at local and global scale making the analysis more and 
more difficult. Hence the need to combine elementary steps into effective steps and focus on 
a part of the system, once the level of detail has been decided and pertinent questions have 
been formulated, with the necessity to be able to perform the analysis in some mathematical 
framework. 

If on one side experimental evidence and measures on the fluctuations in protein production 
are recent, nevertheless the problem has been tackled since the end of 70’s and a quite large 
literature has followed in subsequent years. These models of stochastic gene expression have 
addressed the problem considering effective main steps in the production of a specific protein 
under reasonable assumptions. These assumptions and this modeling approach have been pro- 
posed when the process was not yet well studied and understood. A Markovian description of 
the system has been then adopted, allowing a simplification of the analysis and leading to derive 
analytic formulas of the mean and variance of the different processes modeled, but such descrip- 
tion comes with a strong assumption, i.e. each step has an exponentially distributed duration. 
If the obtained theoretical results are consistent with experiments with corresponding effective 
biophysical parameters, however these models are “semi-quantitative”. In fact,if on one side it 
is possible to find appropriate parameters to fit experimental data, however these models have 
not been used in a predictive way, i.e. choose parameters beforehand and compare the outcomes 
with the experiments. 

If the Markovian description was adopted to simplify the mathematical analysis of the model, 
the strong exponential assumption that comes with it has not been discussed in the light of the 
findings of the last decades. However, in order to investigate the pertinence of the Markovian 
description and of the strong assumption which comes with, we are forced to leave such classic 
framework for a more general one. The aim of the thesis work is on one side to revisit the 
reference model of stochastic gene expression [54] and discuss the assumptions with respect to the 
knowledge acquired in the last years and, on the other side, to consider the coupling of different 
proteins. The first objective lead us to introduce a new mathematical framework based on 
Marked Poisson Point Processes (MPPP), in order to overcome the theoretical limitations of the 
classic approach, providing a general context to test assumptions and mix processes of different 


nature. The second objective is instead connected to the observation that all types of proteins 
share common cellular machinery, such as ribosomes, present in limited numbers. The limited 
availability of these resources lead to a rude competition between different protein production 
chains. In particular, we analyze rigorously the impact of the competition of messengers for 
ribosomes on the resulting fluctuations in protein copies and on the number of free ribosomes. 


Presentation of subsequent chapters 


Chapter 1: Introduction. In this chapter we give few elements of the complex biology that 
underlies gene expression. In this perspective, this introductory biology-oriented chapter should 
allow the non-biologist reader to better follow the next chapters. However, this chapter is not 
intended to be exhaustive from the biological point of view, since the presentation of the biology 
is functional to the following chapters. Important experimental results concerning stochasticity 
in gene expression are then presented as well as a short description of few of the fundamental 
models in the area. Few key concepts of stochastic gene expression are then recalled. 


Chapter 2: MPPP description of gene expression. In this chapter we introduce the 
MPPP mathematical framework and derive a non-Markovian description of the gene expression 
of a single protein based on a three-stage model. Analytic formulas of mean and variance of the 
various modeled processes at equilibrium are derived under general assumptions. Tests in this 
first approach of possible choices of general steps point towards a counter-intuitive result in terms 
of protein fluctuations: if the exponential duration of one step is replaced with a deterministic 
one, the corresponding fluctuations in protein number result increased. 


Chapter 3: Realistic model of gene expression. The framework introduced in Chapter 
is used to build a realistic model of gene expression, the four-stage model, including the 
supplementary step of the protein elongation step and the description of protein dilution due to 
volume growth. It is then discussed the choice of realistic biological assumptions and the model 
is therefore specified and analyzed. In particular, the counter-intuitive result of Chapter |2] is 
recovered when analyzing the impact of protein elongation on the overall fluctuations. Moreover, 
the formulas derived with an accurate description of protein dilution prove that the formulas in 
literature are oversimplified and rely possibly on too strong assumptions. 


Chapter 4: Multi-protein model. In this chapter the problem of the interactions deriving 
from the simultaneous production of many proteins types. Unlike the previous chapters, where 
the amount of free ribosomes is constant, we suppose ribosomes are present in limited number 
and consider their switching between actively elongating proteins and freely diffusing in the 
cytoplasm. Messengers compete against each other for the (rare) resource of free ribosomes and 
the protein production is globally affected. The system of production is studied, by considering 
a large number of protein types with specific concentrations, via a mean-field approach both 
in the underloaded and overloaded regimes of use of the ribosomes. In particular, under the 
realistic biological assumption of overloaded ribosomes, we find that at equilibrium the number 
of free ribosomes follows a Poisson distribution and the rate of production of each protein type 
is obtained. 


Appendix. Appendix[Âfrecall few fundamental definitions and theorems used in the manuscript. 
Appendix [B] is devoted to recalls of biology. In particular, Section gives a more-in-depth 
description of few fundamental biological mechanisms, while Section [B.2] serves as a biological 
glossary, describing few specific cellular components. 


4 CHAPTER 1. INTRODUCTION 


1.1 Gene expression: main mechanisms 


The rest of the chapter should introduce the non-biologist reader to the topic of gene expression 
and its modeling, through a tour of the main biological mechanisms involved and the state- 
of-the-art of experiments and models. However, this is not intended to be exhaustive and the 
biological mechanisms described are oversimplified, since they are introduced in the perspective 
of the mathematical modeling of the next chapters. 

In 1953 James D. Watson and Francis Crick published in the journal Nature an article [72] 
in which they expose their model of the structure of the DNA, which featured the anti-parallel 
double helix held together by hydrogen bonds between pairing nucleotides. In those turbulent 
years, theoretical biologists proposed models using the partial information at their disposal, 
years before the first experimental results of molecular genetics were available. All the models 
and mechanisms proposed were based on the information that could be extracted by indirect 
observations and this was also the case for Watson and Crick, who used X-ray diffraction images 
to propose their model. 

The Watson-Crick model provided also key insights to explain how genetic information is 
transferred from a generation to the next and how this information may be spread within the 
cell. The authors, in the Nature article [72], write 


It has not escaped our notice that the specific pairing we have postulated immediately 
suggests a possible copying mechanism for the genetic material. 


1.1.1 Central Dogma: fifty years of molecular biology 


The central dogma of molecular biology, that was first stated by Crick in 1958 [II], was then 
re-stated by the author in 1970 as follows: 


The central dogma of molecular biology deals with the detailed residue-by-residue 
transfer of sequential information. It states that such information cannot be trans- 
ferred from protein to either protein or nucleic acid. 


The two main concepts that were produced in the late 50s were those of sequential infor- 
mation and of defined alphabets. At the time was already known that proteins have a specific 
three-dimensional configuration, which affects the activity of the protein itself. The researchers 
decoupled the problem supposing that the amino acid chain was able to fold itself up, reducing 
the problem to a one dimensional one and allowing to focus just on the assembly of the polypep- 
tide chain. It was well-established at that time that the alphabet of the proteins is composed by 
twenty amino acids, but it was unknown the mechanisms that lead to their encoding. 

At that time it was well known that DNA, RNA and proteins play a leading role in gene 
expression and the central dogma is a possible solution to the problem consisting in the formu- 
lation of general rules for the information transfer from a polymer with a defined alphabet to 
another one. 

Crick represents the flow of detailed sequence information from one chain to the other using 
arrows, in a schema as in figure where all possible transfers are plotted. The transfers 
could be divided in three group following the general opinion in the late fifties: those for which 
that seemed to exist because of direct or indirect evidence, those which have no experimental 
evidence nor a strong theoretical need and those which were unlikely to exist. Crick carries 
out a progressive simplification of this scheme excluding first the processes in the last class and 
validating those more likely to happen. 
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information in gene expression. in 1970 [I2]. 


Figure 1.1: All possible flows are showed in figure |(a)| Figure shows the picture resulting 
in Crick’s paper [12]. The Solid arrows are “general transfers” (first class), dotted arrows are 
“special transfers” (second class) and the absent arrows are the undetected transfers. 


Using the classification made by Crick in 1970 [12], we can draw the schema shown in figure 
Here the solid arrows represent the “general transfers” (first class), while the dotted arrows 
are the “special transfers” (second class). The absent arrows are the undetected transfers. 

The central dogma has to be read as a negative statement saying that there are no information 
transfers from protein, stressing out which are the most likely transfers (solid lines) and which are 
the probable ones (dotted lines). Nevertheless the central dogma does not say anything about the 
machinery involved and the control mechanisms. It was an attempt to give theoretical insights 
on the main principles which lead to the expression of a gene, using the partial information 
available at the time and represents the very foundations of molecular biology. 

Experiments have confirmed the correctness of the main principles stated by Crick and new 
technologies have considerably increased the knowledge on the subject and have given a detailed 
description of the underlying biochemical reactions. This descriptive approach seems to have no 
end: finer mechanisms pop up when more accurate techniques are available and take their place 
in the already complex scenario of gene expression. 

Despite extensive researches in the field and the many knowledge acquired, little is known 
about fundamental mechanisms and strategies underlying protein production, because of the 
extreme complexity of the whole process and the stochastic nature of the elementary biochemical 
reactions. For all these reasons, mathematical models represent a tool of investigation, in order 
to isolate mechanisms and check hypothesis based on the acquired knowledge. 


1.1.2 Gene expression: main biological mechanisms 


The present section is devoted to a general short introduction of the main steps of gene expression 
and of the main biological mechanisms which intervene in such complex process. This is not 
intended to be exhaustive, but to introduce the basic terminology which will be used in the 
following chapters. Specific biological mechanisms will be introduced through the manuscript 
when needed. 

Despite the Central Dogma gives the fundamental principles of information transfer in gene 
expression in any cell type, the description of the process via mathematical modeling should 
take into account the specificity of the cell types. In particular, models need to distinguish 
between prokaryotic and eukaryotic cells, at least for specific mechanisms and for their different 
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geometric organization. This PhD work focuses on prokaryotes and the subsequent modelisation 
is therefore affected. However, we will make clear when a modeling choice is strictly connected 
with prokaryotes; all other choices must be understood as common to both cell types. 


Gene activation 


Gene activation is the process which allows a gene to be expressed at a specific time. The way 
this activation may occur varies a lot from gene to gene and from organism to organism. The 
main mechanisms causing gene activation are the dissociation of a repressor and the association 
of an activator. 

A repressor is a DNA-binding protein that regulates the expression of a specific gene by 
binding the operator, which is a segment of DNA that a regulator binds to. The binding of the 
repressor blocks the attachment of RNA polymerase to the promoter and prevents the transcrip- 
tion of the genes. If an inducer molecule is present, it can interact with the repressor and inhibit 
its action by detaching it or preventing its binding to the operator. 

An activator is a DNA-binding protein that regulates one or more genes by increasing the 
transcription rate. RNA polymerase binds to the promoter region of the gene, forming a complex 
which sometimes proceed to gene transcription. An activator recruits the RNA polymerase to 
its promoter region. 

If the two previous mechanisms are shared between 
prokaryotic and eukaryotic cells, chromatin remodeling 
is specific to eukaryotes. Chromatin is the complex of 
DNA and histone proteins with which it associates. Hi- 
stones are highly alkaline proteins found in eukaryotic 
cell nuclei that package and order the DNA into struc- 
tural units called nucleosomes. Chromatin on one side 


serves as a way to condense DNA within the cellular TTT ALL LA LH i ALL A L 
nucleus and, on the other side, as a control of gene ex- 


pression. Raser and O’Shea [62] hypothesize that chro- 
matin remodeling is the key regulation mechanism for 
certain eukaryotic promoters. 
Gene activation is a complex process resulting from 
different mechanisms and it is gene and organism spe- 
cific. Despite genes may show different states, in first 
approximation it can be described as a two-states pro- 
cess, i.e. the gene may show only two possible states, 
active or inactive. Inactive gene 
The number of copies of a gene within bacteria is a 
fundamental factor and should be considered in a model 
describing the expression of a specific gene. When bac- Figure 1.2: Gene activation. When the 
teria are growing they duplicate their DNA, that leads repressor (red cartoon) binds to the 
to a number of at least two copies per gene, since the gene, it inhibits the mRNA transcrip- 
genetic information has to be split between daughter tion while the gene is activated when 
cells. the repressor unbinds. 


Active gene 


Remark. Bacteria are often obliged to have more than 

two copies of DNA, since the duration of replication (~ 40 minutes) is sometimes longer than 
cell cycle time, which is ~ 20 minutes in Escherichia coli in fast growth conditions. For this 
reason, in “normal” growth conditions we observe DNA regions with one, two or four copies of 
genes, while in “regeneration” regime, where the cell division cycle takes about 20 minutes, we 
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have up to eight copies of genes localized closer to the origin of replication. 


Transcription 


The transcription process can be described through the following fundamental steps: 


1. initiation: the polymerase binds to one of the specificity factors ø to form a “holoenzyme” 
in order to attach to a specific promoter in the DNA. The more similar is a sequence to a 
“consensus sequence” the stronger is the binding to the DNA. After the first bond has been 
synthesized, the RNA polymerase must clear the promoter (this phase is called promoter 
clearance). During this time it may occur that a truncated transcript, called abortive 
initiation, is released; 


2. elongation: after the promoter clearance, the polymerase assembles in a controlled fashion 
the mRNA chain; 


3. termination: the p-independent transcription termination or the p-dependent transcrip- 
tion termination. The first involves terminator sequences within the RNA that signals the 
RNA polymerase to stop. The latter uses the p terminator factor to stop RNA synthesis. 


For further details we refer to Appendix 


-9 “=e 
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Figure 1.3: Transcription. The polymerase binds on the active gene and transcription initiation 
takes place. Once the initiation step is completed, the polymerase starts copying one DNA strand 
and elongates the mRNA, which is eventually released into the cytoplasm. 


The transcription regulation controls the frequency and the number of produced messengers. 
The gene transcription is subject to many control mechanisms and we just recall the most 
common. The specificity factors alter the specificity of RNA polymerase for a given promoter or 
set of promoters, making it more or less likely to bind to them, i.e. sigma factors in prokaryotic 
transcription. Other regulations are made at gene level and have been enumerated in the previous 
section. In post-transcriptional phase the regulatory machine controls the number of mRNAs 
that are translated into proteins. The stability and distribution of the different transcripts is 
regulated (post-transcriptional regulation) by means of RNA binding protein (RBP) that controls 
the various steps and rates of the transcripts. 

Prokaryotic and eukaryotic transcription shows peculiar characteristics. In fact, since there 
is no precise spatial organization in prokaryotes, translation step can start when the polymerase 
is still building the messenger. This is not possible in eukaryotes since transcription occurs in 
the nucleus and, therefore, the messenger needs first to be exported out of the nucleus in order 
that the translation can take place. 
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1.1.3 Translation 


Schematically prokaryotic translation consists of the following steps (see Figure[1.4] for schematic 
representation): 


1. initiation: which involves the assemblage of components such as ribosomal subunits (50S 
and 30S), mRNA, the first aminoacyl tRNA, GTP (energy) and initiation factors (IF1, 
IF2, IF3). The tRNA (transfer RNA) serves as the physical link between the nucleotide 
sequence of mRNA and the amino acid sequence of proteins. In particular, the aminoacyl 
tRNA (or charged tRNA) carries an amino acid to the ribosome as directed by the three- 
nucleotide sequence (codon) read by the ribosome. The ribosome has three sites: A, P and 
E sites. The A site is the entry-point for aminoacyl tRNA, except for the first that binds 
directly on the P site. In the P site the peptidyl tRNA is formed, i.e. a tRNA bound to 
the peptide being synthesized, and in the E site the uncharged tRNA detaches from the 
ribosome; 


2. elongation: it is a controlled process in which the polypeptide chain is elongated with 
the addition of amino acids to the carboxyl end of the growing end. Elongation involves 
several elongation factors, a conformal change, bond formations, etc. The aminoacyl tRNA 
attaches in the A site, then moves to the P site where the polypeptide is attached to the 
growing chain and the uncharged tRNA is moved to the E site where exits from the complex; 


3. termination: occurs when one of three terminating codons moves to the A site. These 
codons are not recognized by any tRNA but by the so called release factors. These factors 
trigger hydrolysis of the ester bond and release the newly produced protein in the cytoplasm. 
The ribosome recycling step is responsible of ribosome disassembly in such a way to be 
ready to start translation of other messengers. 


Translation is carried out by more than one ribosome simultaneously. Because of relative large 
size of ribosomes, they can only attach sites on mRNA at least 35 nucleotides apart. The so 
called polysome is the complex of one mRNA and a number of ribosomes attached to it. 


A a Å 
W P x 
Initiation 


Figure 1.4: Translation. The ribosome binds on the messenger and translation initiation takes 
place, involving a series of controls in order to assure the right progress of translation. Once 
the initiation step has completed, the ribosome starts the elongation of the protein chain and 
eventually releases it into the cytoplasm. 


The translation of mRNA can also be controlled by a number of mechanisms, mostly at the 
level of initiation. Recruitment of the small ribosomal subunit can be modulated by mRNA 
secondary structure, anti-sense RNA binding or protein binding. In both prokaryotes and eu- 
karyotes there is a large number of RNA binding proteins, which are often directed to their target 
sequence by the secondary structure of the transcript. This structure may change depending on 
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certain conditions, such as temperature or the presence of a ligand. Moreover, some transcripts 
act as ribozymes and self-regulate their expression. 


mRNA degradation 


The process of messenger degradation is an essential function for recycling nucleotides and for 
regulating the level of gene expression and is performed by RNase. The decay process occurs on 
short time scales, i.e. the typical half-life of a messenger is of about two minutes at 37°C in most 
cases. This rapid decay process serves to permit to continuously adjust the number of specific 
messengers to the population needs depending on the specific environmental conditions. 

The decay process consists of two main steps [T4]: 


1. initiation: primarily due to endonucleolytic attack mediated by the RNase E enzyme in 


E. Coli [41]. 


2. break-down: following the initial endonucleolytic cleavage, which is thought to inactivate 
the message for translation. Additional cleavages take place and result in breakdown of the 
mRNA into fragments. 


Experiments have shown that prokaryotic mRNAs are more unstable and have shorter lives 
than in eukaryotes. This is probably connected to the absence of a physical separation be- 
tween the sites of RNA synthesis and RNA function; decay is possibly the major form of post- 
transcriptional control in these organisms. The stability of the mRNA is connected also to the 
competition of RNases and ribosomes to bind to messengers [58], i.e. genes with a weak affin- 
ity of ribosomes and mRNAs show higher levels of mRNA degradation, since ribosome binding 
protects the messenger from decay [3] [9] [74]. 


Protein decay: proteolysis and volume dilution 


Two main mechanisms of profoundly different nature are responsible of the decay of proteins: 
proteolysis and volume dilution. 

The first mechanism is analogous of the degradation mechanism of messengers, but it is now 
mainly used to destroy possibly dangerous proteins, such as misfolded proteins, since protein’s 
structure determines not only its specific cellular function, but also its intracellular stability. The 
degradation machinery differs between eukaryotes and prokaryotes, as shown in the review article 
of Goldberg [21]. Prokaryotes, in particular, have developed an elaborate proteolytic machinery 
to quickly destroy misfolded proteins. If protease is the enzyme that conducts proteolysis, nev- 
ertheless, the machinery is much more complex, since if proteases were free to act in the cytosol, 
“they would quickly convert the cell into a bag of amino-acids” [21]. In any case, the proteolysis 
appears as a control mechanism to prevent the release/survival of malfunctioning proteins or to 
remove damaged proteins. Actually, proteins are continuously subjected to stress, such as tem- 
perature, that eventually causes the protein denaturation. The denatured protein needs to be 
removed since its functioning has been compromised. This aging phenomenon of proteins occurs 
on long timescales: the average protein lifetime is usually bigger than the protein cell cycle. 

The second mechanism, protein dilution, is of completely different nature. Both prokaryotic 
and eukaryotic cells double their internal components in order to give rise to two daughter 
cells. The volume growth associated to the doubling affects the concentration of each cellular 
component because of volume dilution. Intuitively, if we stop the production of proteins at some 
point, their concentration will drop down as a consequence of the increase of cell volume. This 
mechanism is therefore very different from the biochemical interactions which lead to proteolysis. 
Dilution is strictly connected with the cell growth rate and it turns out to be continuous and 
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deterministic, since growth rate is fixed, as long as environmental conditions are kept unchanged. 
In normal conditions and for stable proteins, dilution is the leading degradation mechanism. 


1.2 Stochasticity in gene expression: experiments 


Randomness and determinism are constantly present in the development, growth and life of cells: 
random biochemical reactions have to be reconciled to the precise development of organisms. The 
biological implications of the stochastic fluctuations in gene expression has boosted researches in 
the field, that have multiplied both theoretical and experimental works. 

If the stochastic fluctuations were often taken apart by considering statistics on large numbers 
and reducing the analysis to deterministic models, experimental scientists have become more and 
more aware of the inherent stochastic nature of the gene expression. Researchers have found that 
variability among cells in a genetically identical population is strongly connected with fluctuations 
in the expression of single genes. Stochasticity in the protein production is often just considered 
as a danger for the normal development of organisms. Nevertheless, some living organisms may 
exploit the stochastic fluctuations in the expression of genes to introduce phenotypic diversity 
in genetically identical cells. This variability can be advantageous in specific cases, like face to 
drastic variations in environment or stress conditions, but it can also be very dangerous when it 
turns out to be an obstacle to the realization of the cell program. 

We focus on experimental results concerning stochasticity in the gene expression, from experi- 
mental evidences of the stochastic nature of the phenomenon to negative or positive consequences 
of fluctuations. Few models are considered here and we refer to Sections[L.3]and2.Al]for a detailed 
description. 

In the late 50s Novick and Weiner [50] showed that the production beta-galactosidase (B- 
galactosidase or B-gal) was variable and random in individual cells, but those studies were 
hindered by the lack of reliable measures and were not considered conclusive to prove stochasticity 
in gene expression. One of the first studies which use an expression reporter in single cells was the 
work of Ko et al.[40] in early 90s. In this work researchers have examined the effect of different 
doses of glucocorticoid on the expression of the transgene encoding B-gal and have found a large 
cell-to-cell variability by directly measuring the amount of protein in different cells. Moreover, 
as in Novick’s work, increasing the dose does not increase uniformly the expression in every cell, 
but it increases the frequency of cells displaying high level of expression. The dose dependence 
has been interpreted by authors as a change in the probability that an individual cell would 
express the gene at high level, concluding that the gene expression is a stochastic process. 

In 2002, Ozbudak et al.[51] studied the fluctuations of gene expressing green fluorescent 
protein (GFP) driven by an inducible promoter in Bacillus Subtilis. The authors tune the rate of 
transcription by varying the level of induction of the promoter. Translation rate was modulated 
by introducing mutations in the ribosome binding site (RBS). It results that the transcription 
and translation rates affect the protein fluctuations and the results were interpreted using the 
theoretical model proposed by Thattai and van Oudernaarden [71]. This model predicts that the 
protein relative variance depends on the transcription rate, but it remains unchanged because of 
variations in the rate of translation. 

Elowitz et al.[16] introduced the dual-reporter technique to measure the stochastic fluctuations 
of proteins in Escherichia Coli. This technique allows to express two different fluorescent proteins, 
the CFP and YFP, from identical promoters. Since the two proteins share the same regulatory 
control, the differences between their expression can be attributed to the “intrinsic" stochasticity 
of the gene expression process, because of the random microscopic events which govern each 
reaction. On the other side, the “extrinsic" noise, which derives from cellular heterogeneity, 
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such as regulatory proteins, ribosomes and polymerases, or stochastic events in upstream signal 
transduction, will affect both proteins. We refer to Section [1.3] for a deeper analysis on these 
concepts. Authors showed that extrinsic noise represents a non-negligible portion of the overall 
fluctuations and stressed the necessity to take into account both sources of noise when controlling 
or minimizing the fluctuations of a system. 

Jonathan M. Raser and Erin K. O’Shea used the same technique to study gene expression 
in yeast. The authors analysed three different promoters in the budding yeast Saccharomyces 
cervisiae: the PHOS, PHO84 and GAL1 promoters. The total noise on the three promoters was 
found to be dominated by the contribution of the external factors, such as cell shape and size, 
cell cycle stage or gene-specific signaling. The authors reduced these possible factors of hetero- 
geneity by using experimental techniques, like flow cytometry, used to isolate sub-populations 
with homogeneous sizes. The extrinsic noise resulted to be diminished, but non dramatically. 
Moreover the extrinsic noise was found to be not promoter-specific, since it resulted correlated 
when the two fluorescent proteins were associated with promoters that are distinctly regulated. 
This leads to hypothesize that the extrinsic noise will cause proteins to be maintained in constant 
relative concentrations. In order to analyze the noise in eukaryotes, the authors use a three-stage 
model similar to the model presented by Paulsson , see Section for details. The authors 
claim the applicability of the three-stage model to both prokaryotes and eukaryotes, the main 
difference being the specific mechanisms of gene regulations. Relative differences in the parame- 
ters can lead to different scenarios which can be biologically interpreted. It can be easily showed 
that two promoters can produce the same average number of mRNAs with different fluctuation 
characteristics in this number: a promoter that undergoes frequent activation processes followed 
by inefficient transcription will show smaller variability with respect to a promoter which has 
rare activation processes followed by stable active state. The authors find three characteristic 
regimes of gene regulation, defined in terms of the rates of the three-stage model and which result 
in different noise profiles. In this paper, extrinsic noise seems to be predominant with respect to 
intrinsic and seems to be of global nature, i.e. it affects the expression of any gene. 

A global analysis of the production of proteins in Saccharomyces cerevisiae was conducted 
independently by Bar et al.[2] and by Newman et al.[48] in 2006. Newman and collaborators 
[48] studied fluctuations in more than 2500 proteins, using the pairing of high-throughput flow 
cytometry and a library of GFP-tagged yeast strains to monitor protein levels at single cell reso- 
lution. This new strategy for large-scale protein abundance measurements allowed the scientists 
to deduce that abundance is the major factor governing protein variation, which most likely 
originates from the stochastic production and destruction of mRNAs. Bar and collaborators 
2] studied 43 different proteins under 11 experimental conditions, founding that the variance is 
roughly proportional to the mean, as predicted by models of stochastic gene expression. Highly 
expressed genes seem differentiate from this trend since their variance appears uncorrelated with 
respect to abundance, as showed also in [48]. The researches point to low-copy mRNA fluctu- 
ations and gene regulation as the main responsible for protein fluctuations, which is consistent 
with the scaling property observed. Moreover, using a dual-reporter diploid strain in similar 
fashion than Elowitz [T6], they show that intrinsic noise contributes substantially to the overall 
protein noise in the case of proteins with intermediate expression level, while it is much smaller 
than extrinsic fluctuations for highly produced proteins. Both works [48] and [2] stress how pro- 
teasome genes are characterized by low noise levels, while stress proteins are very noisy, which 
indicates a precise structure in protein-specific variation and suggests that noise levels have been 
selected to reflect costs and potential benefits. 

The works of Yu et al.[75] and Cai et al.[8] have instead focused on the development of 
techniques allowing real time observations with single cell sensitivity, in order to analyze gene 
expressed at low levels. (-galactosidase is the protein studied in both works, since this is the 
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standard reporter for gene expression both in prokaryotes and eukaryotes. A single molecule 
B-gal can produce a large number of fluorescent product molecules by hydrolysing a synthetic 
fluorogenic substrate, which makes 5-gal an high-sensitivity cellular reporter. However, the 
drawback of its use is the fast diffusion of the fluorescent products which are quickly dispersed. 
Cai and collaborators [8] propose to trap the cells into a microfluidic device: cells are trapped 
into closed microfluidic chambers, such that the fluorescent products expelled from the cells can 
accumulate in the small volume of each chamber. Yu and collaborators [75] suggest another 
technique: they designed a fusion protein consisting of YFP (yellow fluorescent protein) and a 
membrane protein (tsr), slowing down the dispersion of fluorescent material and allowing to take 
measures. Both works were performed on Escherichia coli cells with a target polypeptide ex- 
pressed under repressed conditions. Thanks to the use of single-molecule fluorescence microscopy 
on mRNA [61] [22] [42] and on proteins [75] [8], Taniguchi et al.[70] have performed a quantitative 
system-wide analysis of mRNA and protein expression in individual cells in Eschierichia coli, 
see Figure [1.5] The authors, after normalization to account for cell size and gene copy number 
variation due to cell cycle, have measured protein abundances ranging between 107! and 104 
copies per cell. They found that while the noise scales with protein abundance for low expressed 
proteins < 10, as in [2] [48], this is not the case for proteins produced in higher quantities, where 
noise reaches a plateau suggesting that each protein has at least 30% of variation. They made 
striking real-time measurements of mRNAs, using FISH technique, and proteins at same time on 
137 strains for high expressed proteins, analyzing both mRNA production and mRNA-protein 
correlation. 


Noise and, in particular, intrinsic noise is an obstacle to the genetic program since the stochas- 
ticity of biochemical reactions leads to uncertainty in the resulting amount of proteins which could 
be deleterious to the achievement of the cell program. On the other hand, in specific cases these 
fluctuations positively exploited by cells, as a source of heterogeneity or as fundamental tool of 
decision making. 


Starting with positive effects, fluctuations in gene expression are pointed as a major mech- 
anism to obtain different phenotypes in an identical population. This differentiation can lead 
to the spring of sub-populations which are committed to different responses to environmental 
changes. Cell variability can be boosted in the presence of networks that can produce mutually 
exclusive profiles such as ON and OFF expression of a gene: small variations in the gene ex- 
pression can not cause the switch from one state to the other, but rare and large fluctuations 
can lead to a transition. This is the case, for example, of the lysis - lysogeny decision in lambda 
phage-infected E. Coli [3129 or of the lac operon in E. Coli [521145] or the galactose utilization 
network in yeast [33]. In particular, for the lysis - lysogeny decision the stochastic effects in 
the expression of some regulatory factors could explain the “decision" of cells to take the lysic 
or lysogenic pathway. The reason to choose for a stochastic based decisional network can be 
connected with the performances of the resulting strategy. For example, in the presence of food 
cells can adopt two different strategies: they can sense food in the environment and then activate 
the metabolic machinery or they can stochastically decide to activate the metabolic networks in 
some sub-populations in anticipation of possible food arrival. The first strategy is more effective 
but it can be slow, while the second sacrifices few resources for a quick response. Researchers 
have shown how the stochastic switching strategy could be a good alternative to the sensing 
machinery in the cases in which stochastic fluctuations were more or less synchronized with the 
environment fluctuations [I]. Cellular stress, such as lack of food or exposure to antibiotics, 
is another case where stochastic decision could explain the observed behavior in bacterial pop- 
ulations, as shown in the case of competence in Bacillus subtilis [42] [68]. In particular, it is 
shown how the reduction of fluctuations results in lower percentage of competent cells, reducing 
the chances of the survival of the population under stress conditions. Although the utilization 
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Figure 1.5: Stochasticity in gene expression. Quantitative imaging of a YFP-fusion library. 
Figures[(a)|and are representative fluorescence images of two library strains, with respective 
single-cell protein level histograms fitted to gamma distributions with parameters a and b. 


The cytoplasmic protein Adk uniformly distributed intracellularly. The DNA-binding protein 
YjiE with clear intracellular localization. Adapted from|Taniguchi et al.| {2010} [70]. 


of noise for specific mechanisms, noise in gene expression has to be thought as deleterious for 
organisms and for gene expression in particular, which reveals a robustness with respect to fluc- 
tuations. The genome-wide works of Newman et al.[48] and Bar et al.[2] point how the variability 
is gene specific, in particular stress-genes, which are non essential for cell functioning, show high 
fluctuations, while proteasome genes are much less variables. This allocation of noise indicates 
that different production strategies have been selected and are possibly the result of the tradeoff 
between low level of noise in the protein production and the cost in term of resources of producing 
a large number of proteins at any time. 


1.3 Intrinsic and extrinsic noise 


Biological systems are constituted by individuals interacting in changing environments. In par- 
ticular, fluctuations in gene expression are due to the probabilistic nature of the underlying 
biochemical reactions (“intrinsic noise”) as well as to the effect of environment on this production 
(“extrinsic noise”). The measured fluctuations are therefore the result of the combined effect of 
these two sources of randomness, which lead to a system hardly treatable. Decomposing noise 
into separate terms, even if it does not provide information on the latent mechanisms, it al- 
lows to evaluate models without the obligation to specify simultaneously extrinsic or intrinsic 
mechanisms. 

The concepts of intrinsic and extrinsic noise were introduced by Michael B. Elowitz et al. [16] 
and Swain et al.[69] in 2002. The stochasticity inherent in biochemical processes underlying 
gene expression, such as transcription and translation, is referred to as intrinsic noise, while 
the fluctuation in local environment or in the states of any other cellular factor that affects 
gene expression results in extrinsic noise. We make these definitions more clear by describing 
the simple and ingenuous experimental approach designed by Elowitz et al. to perform this 
separation. The researchers used two equivalent independent gene reporters placed in the same 
cell and observed the two copies simultaneously. Correlations between the outcomes of the 
two reporters reflect the influence of the common environment, extrinsic see Figure [1.6a| while 
differences in their expressions are the consequences of the random microscopic events governing 
each reaction, intrinsic see Figure [L.6b] 

If X denotes the number of proteins of interest in a given cell, we can always write the cell-to- 
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Figure 1.6: Intrinsic and extrinsic noise. Two independent identical promoters marked with 
CFP (green) and YFP (red) controlled by the same regulatory sequences. [(a)] In the absence of 
intrinsic noise, the proteins of the two promoter experiment fluctuate in a synchronous fashion, 
because of changes in environment or on global factors impacting gene expression, i.e. number 
of free ribosomes or gene copy number. This correlated fluctuations lead to a population with 
same amounts of proteins in each cell, even if this amount could change from cell to cell because 
of extrinsic factors. [(b)] When considering the random nature of biochemical reactions the levels 
of the two proteins vary in uncorrelated fashion and produce eventually heterogeneity in the cell 
population. Figures from Elowitz et al.{I6]. 


cell variability o% by conditioning the data on the state Z of the extrinsic variables, i.e. number 
of polymerases, ribosomes, ... Therefore, the cell-to-cell variability can be decomposed as 


ox = (o3z) T o? xiz) ; (1.3.1) 


= or 
unexplained by Z explained by Z 


where we used the notation of [27] and where we used the law of total variance. In particular, 
(7X12) is the variance of the random variable X in the subpopulation characterized with extrinsic 
variables Z and the angular brackets denote the averages over all such subpopulations. The 
term o?z) is the variance of the conditional expectation of X given Z. The decomposition 
(1.3.1) is equivalent to the decomposition in the original theoretical paper of Swain et al.[69]. 
However, conditioning on the state of the environment captures the correct contributions only 
under the case of slow environmental fluctuations, but it is not well suited in the case of dynamic 
environment. 

The main issue of decomposition is that it looks at the environment at a precise point 
in time, but the whole history matters and, to keep track of it, Hilfinger et al.[27] propose a new 
decomposition 
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where the first term in the right hand side is the variance of X; in a subpopulation sharing 
an environmental history Z[0, t] averaged over all possible histories and the second term is the 
variance of the conditional expectation of X; given a history Z[0, t], where t = 0 corresponds to 
the infinite past. In ergodic systems, the term o2,, can be interpreted as the time variation of the 
average, while o2, results the fluctuations around the average. By applying the decomposition 
to the two-promoter reporter, see Figure we obtain that Cov( X,Y) = o2,,, where 
X and Y are the numbers of the two identical and independent reporters in the same cell. 
This brings back the original ideas of Elowitz et al.in [16], where they interpreted the extrinsic 
contribution as the correlation between the two reporters, while the intrinsic noise is seen as 
uncorrelated fluctuations of the reporters under investigation. 
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The terms intrinsic and extrinsic remind immediately to “inside” and “outside” of a specific 
system. Therefore, their meaning depends on the definition of the system and the environment, 
as pointed out by Paulsson and Hilfinger [27|, i.e. if we consider the proteins as a system, 
then the contributions of gene activation/deactivation and of mRNAs on the protein fluctuations 
can be classified as extrinsic, while all these fluctuations are intrinsic as long as we consider them 
in the system under analysis. From a biological viewpoint, this noise classification could be 
done according to the importance of a cellular compound for the gene expression. Nevertheless 
in such a classification, ribosomes should be more likely considered intrinsic, since they are 
essential to gene expression, while gene activation/inactivation may result from the regulation 
machinery making it dependent on the state of the cell. The intrinsic noise could be connected 
to the specificity of a protein with respect to the others. More in detail, the specificity of the 
fluctuations of a protein comes from its birth and death processes and by its average number, 
by its messengers fluctuations and, lastly, by the specificity of the activation/deactivation of 
the relative gene. This last process can be less specific for some proteins, since it happens that 
certain genes have a common regulator which can affect all genes in an operon (see Section [B.2.6] 
for details). Operon regulation may lead to complex correlations in the expression of different 
proteins. Moreover there are fundamental factors of gene expression which are commonly shared 
as ribosomes, polymerases, tRNAs, amino acids, ... For these reasons, when analyzing different 
sources of noise, it is necessary to clearly define the system under analysis. 
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Figure 1.7: Intrinsic and extrinsic contributions to noise with respect to protein concentration. 
Noise for proteins produced in small quantities scales with the inverse of protein abundance, as 
predicted by theoretical models of intrinsic noise, see [54 [57] [71] and results in Chapters [2] and 
However, it reaches a plateau for proteins produced in high quantities, as shown in the plot. 
Figures from Taniguchi et al.[]. 


In 2005 Rosenfeld et al.[66] focused on the dynamic evolution of fluctuations and were able 
to estimate the time scales of both intrinsic and extrinsic noises. Extrinsic fluctuations exercise 
their effect on the cell-cycle time scales, i.e. about 40 minutes, while intrinsic noise intervenes 
on shorter time scales, less than 10 minutes. The slow dynamics and the amplitude of extrinsic 
fluctuations cause the genetic networks to have a memory of about one cell cycle. 


Large scale experiments have investigated the gene expression at large scale [2}|48][70]. Despite 
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differences of the organisms under analysis or quantitative differences in the results, all studies 
show that at low or intermediate expression levels the intrinsic fluctuations are relevant, while 
extrinsic fluctuations are prevalent for highly expressed genes. In particular, when intrinsic noise 
is relevant the noise scales with protein abundance, while it reaches a plateau when extrinsic 
noise is predominant, see Figure [1.7] for details. The extrinsic noise accounts for a large number 
of possible fluctuations and can be divided into global, i.e. events that affect expression of all 
genes, and gene specific, where for example fluctuations of a particular transcription factor affect 
the expression of one specific gene. The large scale experiments point to global effects of noise 
affecting all measured proteins, for this reason the noise is associated to fluctuations in cellular 
components such as polymerases, metabolites and ribosomes. 

In conclusion, the concepts of intrinsic and extrinsic noise are important when dealing with 
experiments, since they permit to separate different sources of noise and allow to analyze in- 
dependently the two components using specific modeling. However, these concepts do not give 
any help in the characterization of the underlying mechanisms and further modeling is needed to 
understand in deep the major steps affecting gene expression both at the level of a single protein 
and of the whole cell. 


1.4 Stochasticity in gene expression: models 


Stochastic models of gene expression have been proposed long before the first experimental 
evidence of the stochasticity of protein production. The works of Rigney and Schieve [64] 
and Berg [5] in the late seventies are the first papers proposing a stochastic modeling of 
gene expression. 

At the time the experimental techniques did not allow to measure quantities in single cell, but 
the approach was to measure the number of proteins by averaging over a subpopulation. This 
approach clearly do not allow to take into account the eventual fluctuations between the individ- 
uals of a bacterial population. On the other hand, few experiments aimed to show fluctuations in 
protein production by indirect means, as done by Novick in [50], and there was a growing 
belief that the gene expression is a complex stochastic process. This theory was corroborated by 
physical reasons such as the fact that the chaotic motion of the molecules should reflect into fluc- 
tuations in protein levels. Moreover, a direct visual evidence of fluctuations was the observation 
of randomly distributed transcription initiations using electron miographs of microbial genetic 
activity. 

Given the previous context Rigney and Schieve proposed two models of gene expression. In 
the simpler one, see [64], the authors consider the stochastic transcription as the main process 
which affects the production of a specific protein. More in detail, they consider a two state 
promoter:“occupied” status, when a polymerase is bounded on the gene, and “free” status oth- 
erwise. They then compute characteristic quantities such as the average time of transcription 
inter-initiation and, more importantly, they compute the average number of mRNAs initiated 
per promoter and its variance at equilibrium. Moreover, they were able to recover few experi- 
mental results, such as the fact that the average number of molecules per cell increases linearly 
proportional to time, and to give a characterisation of the fluctuations around the mean value. 
The description of the processes is Markovian: no matter of the history of the process, the state 
at present is sufficient to determine the future path of our process. Despite the simplicity of 
such model, Rigney and Schieve introduce the main tools of the classic approach. In fact, they 
describe the time evolution of the probability density function via the Kolmogorov backward 
equations, also referred to as master equation, and derive all the statistics of interest by using 
classical tools of renewal theory, see [10] for details. These results can also be derived by using the 
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generating function approach whose main ideas will be recalled in Sections [2.1.3] and [2.A]. The 
more complete model of Rigney [63] is a more detailed model of protein production in bacteria. 
By using the same approach as in |64], the author proposes a complete description including 
protein production and degradation and protein partitioning at cell division, thus proposing a 
population description of protein production under constant conditions. 

In the same period, Berg studied the production of proteins in a microbial population, see [5]. 
The aim of the author was to analyze more in detail the fluctuations of the number of proteins 
at steady state, using a finer description of the process. The lack of single-cell experiments 
made stochastic description the sole mean of investigation of the behavior of individual cells and, 
because of its quantitative character, the characterisation of the fluctuations can bear information 
on some mechanisms in protein production. Berg considered a microbial population at steady 
state with specific doubling time, which is supposed to be deterministic. More in detail the 
author described the dynamics of mRNAs, which, in turn, determine the number of proteins 
produced in the population. Since the specific protein is assumed to be stable, the proteolysis is 
not considered and the only mechanism which prevents an infinite accumulation is the binomial 
partitioning of protein at cell division. The probability distributions of messengers and proteins 
are then formally derived via the master equation and an explicit characterization of the average 
and of variance of protein copy number is given by means of generating functions. All the results 
depend on the cell age in the cell cycle and are valid for synchronously dividing cells. Moreover, 
as for the work of Rigney and Schieve, the assumption of exponentially distributed durations 
lead to a Markovian description of the whole process. One of the main results of the paper is the 
confutation of the Poisson assumption for the produced proteins. In fact, up to that date, it was 
supposed that the protein number follows a Poisson process. Berg showed how the fluctuations 
of the protein number, when considering the transcription step, can be much wider of the Poisson 
distribution, emphasizing the importance of an appropriate description of gene expression. 
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Figure 1.8: Peccoud & Ycart model. The model describes the gene induction, represented as 
a transition between active and inactive states, and the production of proteins. The level of gene 
expression is proportional to the fraction of the regulatory factor in the activated state. Protein 
degradation is considered as well. 


Since the introduction of reliable single-cell experiments in the nineties such as the pioneering 
work of Ko et al.[39], we have assisted to a renewed interest into stochastic models of gene expres- 
sion. In 1995, Peccoud and Ycart [57], inspired by the work of Ko and collaborators, proposed a 
simple Markovian description of gene expression. With respect to Rigney and Berg works, the 
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authors focus on a simple gene regulation mechanism: the gene induction. An inducible gene is 
normally silent, but can be expressed under the control of external signals, the reaction being 
usually mediated by a regulatory factor. The activation of such regulatory factor is reversible 
and is described as the switching between active and inactive states, as shown in figure [1.8] The 
messenger dynamics are not considered, but the level of proteins is assumed to be proportional 
to the time the gene spends in active state. The analysis is restricted to a single cell and it is 
not considered the population dynamics, however the degradation of proteins allows the protein 
number not to explode to infinity as long as we look at equilibrium dynamics. The statistics are 
derived using similar tools as Rigney and Berg and, in particular, they give close formula for 
the transient behavior of the mean number of protein copies and their variance. The asymptotic 
statistics are then derived and a simple method of parameter estimation, using the information 
that can be derived by experiments. 


Paulsson in 2005 [54] gave a review on stochastic gene expression and proposed a model 
which summarizes the main features of the models found in literature. As the works presented 
above, the model aims to characterise the average and fluctuations of the number of copies of a 
specific protein when equilibrium is reached. This model, with respect to the model of Peccoud 
and Ycart presented above, considers also the dynamics of messengers. The described processes 
of gene expression are the activation and deactivation of the gene by means of a repressor, the 
dynamics of the corresponding mRNA and of protein. More in detail the gene can show two 
different states, active and inactive, where the transition from one state to the other is supposed 
to be exponentially distributed. If the gene is in active state, then it produces a new mRNA 
following a Poisson process, which describes as well the mRNA degradation by means of RNase. 
Proteins are translated by mRNAs and are degraded, both processes showing exponentially 
distributed duration. 
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Figure 1.9: Paulsson’s model. In this model gene activation/deactivation is described and is 
represented here via the binding/unbinding of a repressor, in red. When the gene is in active 
state it can undergo transcription and it produces a new mRNA at a certain rate. The mRNA 
can then be translated by ribosomes, represented in blue in the figure, and produces proteins at 
a certain rate. Both mRNAs and proteins can be degraded. Note that the duration of all the 
described steps is exponentially distributed, which allows for a Markovian description of gene 
expression and which is the fundamental common characteristic of classic modeling. 
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1.4.1 Limits of classic models 


In Section we have seen how the encounter of two cellular molecules is the fundamental 
mechanism of the biochemical reactions underlying gene expression. The objective of mathe- 
matical biology is to extract information on such system and on its specific mechanisms, via a 
mathematical description and characterization. However, gene expression comprises a colossal 
number of elementary steps, hence the need to combine those steps into effective steps and focus 
on a part of the system. The intricate interaction pathway linking numerous processes both at 
small and global scale contributes to the complexity of the system. The elementary processes un- 
derlying gene expression are therefore grouped into critical steps, as we have seen in the Section 
where the main steps of transcription and translation are modeled globally as a first-order 
chemical reaction, i.e. they are supposed to be exponentially distributed. 

The models in literature, up to our knowledge, describe the duration of each step to be expo- 
nentially distributed. Nevertheless, the suitability of these assumptions and of the description of 
the gene expression process has been poorly investigated. We recall briefly the main ideas and 
mathematical framework that stand behind classic models and the crucial assumption, that we 
will refer to as “exponential assumption”. 


The classic three stage model and mathematical toolbox 


In the beginning of this section we have described few of the fundamental models used to describe 
gene expression in literature, among which the three-stage model [54] is the reference model for 
experimentalists. All these models share a common underlying description, that can be already 
found in the first systematic and accurate studies of stochastic models for gene expression, 
see Rigney [63] [64] and Berg [5]. In recent years the three-stage model has been used as the 
fundamental structure in most well-known works of Shahrezaei and Swain [67], Paulsson [54] 
and Peccoud and Yeart [57]. 

In these studies the promoter of the gene, corresponding to the specific protein of interest, can 
be in one of two possible states, active or inactive, and the switching occurs up to a exponentially 
distributed random time. Moreover, transcription, translation and the degradation of proteins 
and messengers are supposed to be exponentially distributed (or geometrically distributed in 
case of a discrete time setting). 

The fundamental assumption of exponentially distributed duration of the various steps of the 
three-stage model allows a Markovian modelling. Without loss of generality, we consider the 
one gene case and denote with Y(t) € {0,1} the state of the gene at time t, where Y(t) = 1 
indicates that the gene is active at time t, while Y (t) = 0 if it is inactive. If we denote by No(t) 
the number of mRNAs and by N3(t) the number of proteins, then it turns out that (X(t)) = 
(Y (t), No(t), N3(t)) is a Markov process with values in {0,1} x N? and this representation covers 
most of the models described in the beginning of this section, see Section [2.A] for further details. 

As a consequence of the Markovian description, we obtain a system of linear differential 
equations of order 1, the Fokker-Planck equations, of p(t, (y, n2, n3 )), i.e. the probability that X (t) 
is in state (y, n2, ng) at time t. The solution of the system has a unique stable point (7(y, n2, ng) : 
(y,n2,n3) € {0,1} x N°), the invariant distribution of the Markov process. Although it is not 
possible to obtain an explicit expression of the invariant distribution, it is still possible to get 
information about the moments of the invariant distribution. In particular, since the coefficients 
of the obtained Fokker-Planck equation are affine with respect to the variables n, the moments of 
the invariant distribution satisfy a recurrence equation, giving an explicit expression for the first 
two moments and, in particular, for the variance. This is the main theoretical result that has 
been used in many papers in literature, see [54] [64] [5] (69) [71] [57], and is possible exclusively under 
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the assumption that all the durations of the main steps, such as mRNA and protein production, 
are exponentially distributed. 


Exponential assumption 


We refer to exponential assumption when the time to produce a particular cellular component 
and its lifetime are assumed to be exponentially distributed. 

The exponential assumption is natural in the following simple situation: a large number of 
trials are necessary to achieve some goal (like transcription or translation initiation) and each 
trial requires some duration D and succeeds with probability a. This scheme describes correctly 
the duration of time to establish a binding and, therefore, it may describe properly the time 
required for a successful binding of RNA polymerase to the gene and of ribosome to mRNA. 

On the other side, this assumption may not be true if we consider the elongation time of an 
mRNA or protein chain. In particular, the protein elongation results in an iterative procedure 
in which each codon of the messenger chain is coupled with a particular tRNA, which adds 
a new amino-acid to the growing polypeptide chain by means of ribosome, see Figure [1.10] 
The insertion of a new amino-acid requires firstly the encounter of a charged tRNA and the 
resulting distribution of the elongation step, which is the sum of exponentially distributed random 
variables, is no longer exponential. 


Figure 1.10: Protein elongation. 


Another phenomenon is not described in the classic models: protein dilution. The volume 
growth associated to cell doubling affects the concentration of all cellular component, and in 
particular of proteins, because of dilution. During exponential growth phase, dilution is proven 
to be the main cause of protein disappearing, while proteolysis play a minor role. Few papers 
in literature [43] affirm to consider dilution as the protein degradation mechanism, but its 
description is wrong since it is modeled as a discrete stochastic process, while there is a wide 
agreement on the fact that it is deterministic and continuous. 

Unfortunately, there is no hope to include these two phenomena in the classic mathematical 
framework, since this approach is strongly limited by the exponential assumption, which limits 
the possibility to extend the existing models and to consider possibly more general mechanisms, 
using the knowledge acquired on gene expression through experiments. 


Chapter 2 


MPPP description of gene 
expression 


The study of the fluctuations of the number of a specific protein in a cell is a crucial problem for 
biologists in order to better understand the production of proteins and has already been tackled in 
the past. In particular, researcher have obtained close-form analytic expressions of the mean and 
variance of the number of copies of a protein for a simplified stochastic model of gene expression 
(see Section for further details). A Markovian approach (via Fokker-Planck equations) is 
classically used to derive analytic formulas of mean and variance of the number of proteins 
at equilibrium, under the assumption that the duration of all modeled steps is exponentially 
distributed. This assumption is, however, not completely satisfactory from the modeling point of 
view since the duration of some steps is more likely to be Gaussian, if not quasi deterministic. In 
such a setting, Markovian methods can no longer be used and a new approach allowing for more 
general assumptions is required. This pathway is essential for obtaining a finer characterization 
of the fluctuations of the number of proteins, which is of primary interest to understand the 
general economy of the cell and to analyze the solutions imposed by the evolutionary force. 

The present chapter is devoted to the introduction of a new description of gene expression 
which uses the Marked Poisson Point Processes (MPPP) as the main mathematical tool and 
allows to get rid of the exponential assumption. In the following sections, starting from the de- 
scription of gene expression of the classic three-stage model (see Section[2.A 2}, we will introduce 
the technical tools of the MPPP description and derive the main theoretical results. 


Plan of the Chapter. In Section [2.1] we describe the main biological processes, the effective 
steps of the mathematical model and the limits of the classic models. The mathematical descrip- 
tion of gene expression via MPPP is introduced in Section [2.2] and the main general results are 
derived in Section [2.3] In Section [2.4] we derive results for specific choices of distributions. 

The Appendix [2.A] is devoted to the presentation of few techniques used in classical models. 
In particular, the original approach of Rigney [64] [63] and the consolidated approach used in 
more recent papers [54] [57] are presented. 


2.1 Biology and mathematical assumptions 


We present now the biological mechanisms which have motivated our description and show the 
limitations of the classic approach, when it comes to include more realistic assumptions. In the 
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biological section we will briefly recall the main steps of gene expression, for further details see 


Section or Appendix [B] 


2.1.1 Biological context 


The gene expression is the biological process by which the genetic information contained into the 
DNA of a cell is synthesized into a functional product, the proteins. The production of proteins 
is the most important cellular activity, both for the functional role of its products and the high 
cost in term of resources. 


The information flow from DNA genes to proteins is a fundamental process, common to all 
kinds of cells, and consists of two main elementary steps: transcription and translation. During 
the transcription process, the RNA polymerase binds to an active gene, which corresponds to 
a specific protein, and makes a complementary copy of a specific portion of a DNA strand, 
a messenger RNA (mRNA). Each mRNA, which is a long chain of nucleotides, is a chemical 
“blueprint” for a particular protein. The synthesis of the protein chain from the mRNA is 
achieved by a large and complex molecule, the ribosome during the translation step. More 
in detail, the ribosome binds to the messenger and assembles the polypeptide chain using the 
mRNA as a template: to each mRNA codon, a triplet of nucleotides, corresponds a specific amino 
acid, the fundamental brick of proteins. Once the polypeptide chain has been completed, the 
amino acids fold spontaneously or with the help of chaperons, so that the protein may assume 
its functional three-dimensional structure. 


The gene expression is a highly stochastic process and results from the realization of a very 
large number of elementary stochastic processes of different nature. The thermal excitation 
affects many processes, since it implies for example the free diffusion in the cytoplasm in which 
particles behave basically as if they were plunged into a viscous fluid. In a first approximation of 
the production process, three fundamental mechanisms are combined in the protein production. 
The first is the pairing of two cellular components freely diffusing through the cytoplasm and 
is a direct consequence of the diffusion. The second mechanism is the “spontaneous” rupture 
of such a binding, as the result of thermal excitation, and the subsequent release of the two 
components. The last involved stochastic mechanism corresponds to the processing capability 
of both polymerases and ribosomes and results in an active process, since it requires energy in 
order to take place. The active steps associated to the polymerase and ribosome activity are 
highly sophisticated and include, for example, dedicated proof reading mechanisms. In order 
to proceed to transcription initiation, see Section [1.1.2] is required a successful pairing of the 
polymerase to a specific DNA motif. Once initiation step has succeeded, the elongation step can 
take place and this results as a series of specific stochastic processes, in which the polymerase 
recruits one of the four nucleotides corresponding to the DNA template. In a similar fashion, 
the translation step consists in the pairing/unpairing of ribosome to a specific sequence of the 
mRNA and the subsequent elongation step. In particular, the protein elongation results in an 
iterative energy-consuming procedure in which each codon of the messenger chain is associated 
to a specific tRNA, which adds the corresponding amino acid to the growing polypeptide chain 
by means of ribosome. 


In summary, most of the elementary processes can be schematically seen as the encounter of 
two components in a viscous fluid. However, the classic approach to gene expression modeling is 
to group those elementary processes into critical steps as initiation, elongation and degradation, 
which are common to both transcription and translation. 
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2.1.2 Mathematical model of gene expression: three-stage model 


The main steps of gene expression, as described shortly in the previous section, can be translated 
into a mathematical model, whose main steps are now described. We will not dwell here in 
detail, on the contrary, the aim of this section is to introduce the basic notation that will be used 
throughout the chapter. 

The total number of proteins of the specific type is a random variable P. The cell can be 
seen as a complex system producing a given protein with an average concentration E [P], where 
E [X] denotes the expected value of a random variable X. The protein concentration is directly 
connected with the main parameters describing the different processes and this connection is 
made explicit through a quite simple formula, as will be shown later on. We recall that the main 
objective of the mathematical descriptions of gene expression is to derive explicit formulas of the 
variance of the number of proteins in terms of the main model parameters. 


Notation 2.1.1. If not specified differently, a process is said to occur at some rate À, means 
that the duration of such process is exponentially distributed with parameter X. 


Gene activation. The initiation of 
transcription is strongly regulated by vari- 
ous molecular mechanisms like the associa- 
tion/dissociation of a repressor (see Figure | 
and the association of an activator. The Active gene 
gene is said “inactive” if a repressor prevents 
the polymerase binding and is said “active” 
otherwise. Usually the whole process is de- SD 
scribed as a telegraph process for which a tran- 
sition from inactive state 0 to active state 1 
occurs at rate AT and, similarly from state 1 AF AT 
to state 0 at rate Aj. Here the fundamental 
assumption is that the distribution of these 
steps is exponential. In a prokaryotic cell, 
there may be several copies of a specific gene 
and this fact has been included in few models 
in the past years, see Paulsson [54]. Never- 
theless, since we are interested in the variance 
of the number of proteins, we will assume in 


the following sections that there is only one Figure 2.1: Gene activation. The gene activa- 
copy of the gene. The analogous result for the  ¢ion/deactivation occur at rate Af and À, re 
case with multiple copies is straightforward to spectively. 


obtain since, by independence, the variance of 
protein number is proportional to the number 
of copies of the gene. 

Transcription. A RNA polymerase binds on an active gene in an exponential time with rate 
A2. This effective rate measures the frequency of transcription initiation and takes into account 
several physical parameters, including, for example, the affinity between the specific gene and 
the polymerase. The distribution F> on R+ of the lifetime o2 of a mRNA is assumed to be 
general. 

Translation. Similarly, the binding of a ribosome on an mRNA occurs in an exponentially 
distributed time with rate A3, which measures the frequency of translation initiation and includes 
also the affinity between messenger and ribosome. The distribution F3 of the lifetime o3 of the 
protein is also general. The decay of the protein concentration occurs for two main reasons: 


Inactive gene 
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by proteolysis, i.e. the protein degradation into amino acids, or by cellular dilution, due to the 
cellular volume increase of the bacterium during the exponential growth phase. 


Remark. The analysis on gene expression of Chapters[2|and[Bjis focused on the complex process of 
production of a specific protein. For this reason, the interaction of the specific protein production 
chain with the production process of other proteins is not considered. 


2.1.3 Limits of classic models: the exponential assumption 


In the previous section we have recalled the main steps of gene expression and we have seen how 
the elementary processes in the protein production can be seen schematically as the encounter 
of two cellular components in a viscous fluid. 

The classic approach is to group these elementary processes into critical steps, see Section 
and will be recalled in Section [2.2] In particular, the transcription (resp. translation) step 
is modeled globally as a first-order chemical reaction, i.e. it is supposed to be exponentially 
distributed. The same assumption is usually applied to the messenger and protein degradation. 

Nevertheless, the suitability of these assumptions and of the chosen description of the gene 
expression process has been poorly investigated. In this chapter we focus mainly on the model 
assumptions and develop an alternative description of gene expression that allows to retrieve 
analytic close formulas of mean and variance of proteins in a more general context. 

Before introducing this new mathematical description of gene expression, we recall briefly 
the main ideas and mathematical framework that stands behind classic models and discuss the 
crucial “exponential assumption” that comes with. 


The classic three-stage model and mathematical toolbox 


The three stage model described in Section [2.1.2] is the fundamental approach to describe gene 
expression in literature, as testified by the theoretical research on this model and its large use by 
experimentalists. The key steps of this description can be already found in the first systematic 
and accurate studies of stochastic models for gene expression, as Rigney [64] [63] and Berg [5]. 
In recent years the three-stage model has been used as the fundamental structure in most well- 
known works of Shahrezaei and Swain [67], Paulsson [54] and Peccoud and Ycart [57]. See Figure 
2.2 

The promoter of the gene, corresponding to the specific protein of interest, can be in one of two 
possible states: active or inactive. In these studies transcription, translation and the degradation 
of proteins and messengers are modeled as first-order chemical reactions, i.e. they are supposed to 
be exponentially distributed (or geometrically distributed in case of a discrete time setting). With 
the above notations, this amounts to saying that both o2 and o3 are exponentially distributed. 

The assumption of exponentially distributed duration of the various phases of the three-stage 
model leads naturally to a Markovian modeling. The overall dynamic of gene activation can be 
described, see Paulsson [54], by the random variable Y (t) € {0,1}, where Y(t) = 1 indicates that 
the gene is active at time t, while Y(t) = 0 if it is inactive. Recall that we consider, without 
loss of generality, only the one gene case. If we denote by N2(t) the number of mRNAs and by 
N3(t) the number of proteins, then it turns out that (X(t)) = (Y(t), No(t), N3(t)) is a Markov 
process with values in {0,1} x N?. This representation is common to most of the models of the 
literature. Some of them have, in fact, a lower dimensional state space because of assumptions 
on the number of mRNAs for example. We denote with p(t, (y, n2, na)) the probability that X (t) 
is in state (y, n2,n3) at time t, i.e. 


p(t, (y, n2, n3)) = P [X (t) = (y, na, na) - 
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As a consequence, the general theory of Markov processes gives a system of linear differential 
equations of order 1, the Fokker-Planck equations, for the functions p(t, (y,n2,n3)). The system 
of equations has the general form 


d 
dt (t, (y, n2, n3)) = Ai(y)p(t, (1 — y, n2, n3)) + A2p(t, (y, n2 = 1, na))Lty=1} 
F [2(n2 Ja 1)p(t, (y, na ak 1,n3)) F A3n2p(t, (y, n2, n3 = 1)) 


+ us(ns + 1)p(t, (y, n2, n3 +1)), (2.1.1) 


where à (0) = Ay and à1(1) = Af, ui = 1/E(c;) for i = 1, 2. The solution of the system has 
a unique stable point (r(y,n2,n3),(y,n2,n3) € {0,1} x N?), the invariant distribution of the 
Markov process, whose explicit expression is not known to the best of our knowledge. Neverthe- 
less, since the coefficients of the right-hand-side of this equation are affine with respect to the 
number of messengers n2 and the number of proteins n3, the moments of the invariant distribu- 
tion satisfy a recurrence equation. This equation is not trivial, but gives an explicit expression 
for the first two moments and, in particular, for the variance, which is the key quantity to in- 
vestigate fluctuations. This is the main theoretical result that has been used in many papers in 
literature, see [54] [64] [5] [69] [71]. 

It should be kept in mind that this approach is possible only under the assumption that 
all the durations of the main steps (like the production time of an mRNA or of a protein) are 
exponentially distributed. This assumption is now discussed. 


Exponential assumption 


We refer to exponential assumption when the time to produce a particular cellular component 
and its lifetime are assumed to be exponentially distributed. 

We discuss now the appropriateness of the use of the exponential assumption, with respect 
to the biophysical process described. The exponential assumption is natural in the following 
simple situation: a large number of trials are necessary to achieve some goal (like transcription 
or translation initiation) and each trial requires some duration D and succeeds with probability 
a. If Ga is the total number of attempts to succeed, ie. P(G, > n) = (1 — a)”, then 


lim P(aG, > xz) =e” for x > 0. 

a—0 
In other words, if œ is small then aG, ~ E, where E is an exponential random variable with 
mean 1. Consequently, the total duration of time necessary to realize the objective is, due to the 
averaging of the law of large numbers (Gq is large), 


[D] 


a 


FE 


D Di ~ GaE [D] ~ 


and is therefore exponentially distributed with mean E [D] /a. 

As is seen, this scheme may describe correctly the duration of time to establish a binding of 
a polymerase or of a ribosome. More in detail, it may describe properly the time required for a 
successful binding of RNA polymerase to the gene and of ribosome to mRNA. 

On the other side, this assumption may not be true if we consider the elongation time of 
an mRNA or protein chain. In particular, during the polypeptide chain elongation, each tRNA, 
transporting a specific amino acid (see for details), should bind to the ribosome. If 
the distribution of the duration of this step is indeed exponential, nevertheless the fact that 
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elongation steps requires an average number of 300 — 400 steps, one for each amino acid, then 
the resulting distribution of the duration of the whole process is no longer exponential. In first 
approximation, because of the large number of elongation steps, a deterministic elongation time 
with a small Gaussian perturbation should be considered. 

The description of elongation step and the impact of different choices for the distribution of 
the duration of this step will be analyzed in detail in Chapter[3| In the present chapter we assume 
that both messenger and protein are instantaneously produced, i.e. the duration of elongation is 
zero. However, one of the main results of this chapter is to show, via convenient mathematical 
tools, that the choice of the distributions has an important impact on the quantification of protein 
fluctuations and, therefore, on the qualitative properties of the gene expression. 


2.2 MPPP Description of Gene Expression 


In this section, the various stochastic processes are introduced. The main results and notations 
concerning the marked Poisson point processes (MPPP) is introduced in Section In the 
entire manuscript, all the Poisson point process and the associated random variables are supposed 
to be independent. 


Gene activation 


It is assumed that there is one active gene, which is activated at rate Af and inactivated at 
rate A; . Recall that the assumption that nmax, the maximum number of active genes, is 1 does 
not restrict the generality of our results since the quantities analyzed in this paper (expected 
values and variances) are proportional to Nmax. Let (En) and (Fn) be i.i.d. exponential random 
variables with respective rates À; and AT. The process of activation of the gene at equilibrium 
can be represented as a stationary process (Y (t),t € R) with values in {0,1}. Note that (Y (t)) is 
defined on the whole real line, i.e. that the activation/inactivation process has started at t = —oo. 
As it will be seen, this is a convenient representation to describe properly the equilibrium of the 
protein production process. The increasing sequence of the instants of activation of the gene is 
denoted by (tn, n € Z) with the convention that to < 0 < tı. In particular 


{in,n € Z} = {s € R: Y(s—) = 0 and Y (s) = 1} 


and tn+1 — tn = En + Fn. Because of our assumption (tn) is a stationary renewal point process. 

Since we are interested to study the system at equilibrium and since (Y (t)) is defined on R, 
we can suppose that the process Y(t) started at —oo and is at equilibrium at the instant 0. So, 
instead of start the processes at time 0 and suppose they reach equilibrium after an infinite time, 
we may suppose that the production machinery has started at —oo and it is then at equilibrium 
at time 0. This convention will be used for all the processes investigated. 


Production of mRNAs 


When the gene is active, it produces mRNAs at rate A2 > 0 and F2(dy) denotes the distribution 
on Rx of the lifetime of an mRNA. Let N,,=((5:,02,),n € Z) be a MPPP on R x R4} with 
intensity measure Az dr ® F (dy). 

If the gene is always active throughout the time interval [s, t], then the formula 


Na, ([s, t] x R) n 5 L{s<s, <t} = | 


Lis<u<t} N (du, dv) 
nez EAR 
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Figure 2.2: Three-Stage model. The gene activation/deactivation occur at rate Af and Ay 
respectively. Transcription and translation occur at rates À2 and Ag respectively. The degradation 
times of mRNAs and proteins have probability distributions F2(dt) and F3(dt) respectively. 


represents the total number of mRNAs created between time s and time t, and 


>, Lrscscicsntorn} = | L{scucteu+o}/N), (du, dv) 


nez RXR} 


is the number of mRNAs still alive at time t. More generally, if we include the gene dynamics 
into the formula, we find that the number of messengers created in the time interval [s, t] and 
still alive at time t is 


Ss Vy <n <t<5n+02.n:¥(8n)=1} = V{s<u<t<utv,Y (u)=1} N^ (du, dv). 
nez RE 


Production of Proteins 


A given mRNA produces proteins at rate A3 and F3(dy) is the distribution of the duration of 
the lifetime of a protein. 

For u € R, denote by NX, a MPPP with intensity À3 dx @ F3(dy), this process describes in 
the following the creation of proteins associated to a mRNA created at time u. We assume that 


Wi U2 


x, and Ny? are independent for u1 # u2. In particular, if mRNA lifetime is v then 
Nyut x Ry) = f NY, (de, dy) 
[u,u+v]xR+ 


is the total number of proteins created by such an mRNA during its lifetime. 


Remark. If the gene is always active, i.e. Y(t) = 1, the whole process of production of mRNAs 
and proteins under this specific assumption can thus be described by the sequence 


A= (Sn) Fan, NX") : 
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Recall that NY, : (Q, F, P) + M,(R x R+), where M,(R x R+) is the set of point processes 
on R x R4. If we denote with Q the distribution of NY, on M,(R x R+), the process A can 
be seen as a marked Poisson point process on R x Ry x M,(R x R+) with intensity measure 
A3 dz @F3(dy)@Q. This observation will not be used in the following to keep the setting as simple 
as possible but the proof of Proposition [2.3.5] below could be shortened by using it together with 


Proposition 


The notations with some definitions for the stochastic models used in this paper are now 
summarized. 


Notation 2.2.1. 


e Gene activation. 
The activation rate is [resp. inactivation rate] is AŤ [resp. Ay] and 


at ae 
64 = and A= Af HAT- 


e mRNA production. 
The rate of production of mRNAs by an active gene is A2, F2(dx) is the distribution of an 
mRNA lifetime, o2 denotes a random variable with distribution F> and 


P2 a À2 i [o2] = x f uF (dx). 
Ry 


e Protein production. 
The rate of production of proteins by an mRNA is A3, the lifetime distribution of a protein 
is F3(dx), 73 denotes a random variable with distribution F3 and 


ps = AE [o3] = x f zF3(dr). 
R+ 


2.3 General results of MPPP description of gene expression 


In the previous section, using basic definitions about Poisson point processes, we have given a 
first description of the main processes modeled in the three-stage model. In this section, using 
the MPPP description, we derive general results concerning gene expression. In particular, the 
number of the different processes at equilibrium and general formulas for the mean and the 
variance of each process. We will not discuss here the results for specific assumptions of the 
general distributions and we refer to the following section. 


2.3.1 Gene state 


The behavior of the process (Y (t)) describing the state of the gene is well known in literature. 
In particular, Y(t) € {0,1} and is Bernoulli distributed at equilibrium. In order to obtain the 
stationary distribution wy of the process Y it is sufficient to write the detailed balance equation, 
since (Y(t)) is a reversible Markov process. Therefore, 


Af my (0) = Ap ay (1) 
ty (0) + ty (1) = 1, 
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where wy(0) = P[Y = 0] and zy(1) = P[Y = 1]. Solving the previous system we obtain, at 

equilibrium, 

AF 
1 


Poeme 
DT FA; 


=1-P(Y(0) = 0). 
From now on, it will be assumed that (Y(t)) is defined on R and is at equilibrium. 

In order to obtain analytic expression of the number of messengers and proteins, we need also 
to compute the covariance of the process (Y (t)). For t > 0, 


P(Y (t) = 1]Y (0) = 1) = 84 +(1—d,)e“*, (2.3.1) 


with A = Af + AĮ, in fact, since the process (Y (t)) is stationary, then 


E [Y (u)¥(s)] = E [Y (lu s)Y (0)1Y (0) = 1] P [Y (0) = 1] 
= P [Y (lu — s|) = 1Y (0) = 1] P [Y (0) = 1] 


and we obtain the previous formula by solving the Kolmogorov forward equations, see Note [2.3.1] 
See Norris [49] and Peccoud and Ycart [57] for detailed computations. 


Note 2.3.1. The generator matrix Qy of the two-state continuous time Markov chain Y(t) € 
{0,1} is given by 
Qy = ee ar 
AY AY 


Denote with pij(t) = P(Y (t) = j|Y (0) = à), with i,j € {0,1}. Recall that we want to compute 
pult) = P(Y (t) = 1]Y(0) = 1). The Kolmogorov forward equation P'(t) = P(t)Qy, where 
P(t) = (pij(t))ijetoy, together with the identity por = 1 — pri gives 


pult) = ÀŸ — AT +AT )pu(t), Vi > 0 
pi1(0) = 1, 


whose solution is formula (2.3.1). 


2.3.2 Messengers 
Number of mRNAs 


A result on the number of mRNAs at equilibrium and its distribution is derived in this section. 
The techniques used to prove it will also be used to investigate the distribution of the number 
of proteins in the next section. In order to present the MPPP approach, we will develop compu- 
tations for mRNAs. They are simpler from the point of view of notations and they include the 
main ideas. 


Proposition 2.3.2. The number M of mRNA’s at equilibrium can be represented as 


M= L{u<o<u-+u,Y(u)=1} Mas (du, dv), (2.3.2) 
RXR+ 


where Ny, is a Poisson marked point process with intensity A2 dz @ F(dy). 
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Proof. Let M; denote the number of messengers born after time t = 0 and still alive at time t. 
Then M; is given by 


+00 
M, = > Liocscicontor nY(en)=1} = Í f liu<t<u+v,Y (u)=1} My (du, dv), (2.3.3) 
n + 


where Mya = (5n,02n), see Section Recall that sn is the (potential) nt! binding time of a 
polymerase on the gene: an mRNA is created only if the gene is active, i.e. Y (sn) = 1. The term 
O2 represents the lifetime of the newly produced mRNA. The right-hand-side of the previous 
equation accounts for the number of mRNAs produced in the interval [0, t] and still alive at time 
t (u+v >t). 

Since the process (Y (t)) is stationary as well as the Poisson marked point process, they are 
both invariant by translation. By translating by —t, one gets that M, has the same distribution 
as 


0 
dist. 
M a | L{0<u+0,Y (u)=1} Maa (du, du), 
-t JR} 


by letting t go to infinity, we obtain the desired result. 


It is crucial that the distribution of M, can be explicitly expressed as a functional of the 
marked Poisson process W,,. The same property is true for its limit. In this context, with 
the help of the coupling argument, there is no need of a Markovian setting to prove that M, 
converges in distribution as t goes to infinity. As will be seen, the distribution of the limit M 
can be obtained by using some properties of Poisson point processes. For all these reasons, there 
is no need to assume 02 and a3 are exponentially distributed. 

In the proof of the above result, we have in fact proved a more general following result. The 
point process M representing the instants of creation of mRNAs and the associated lifetime at 
equilibrium can be represented as 


M= LY (u)=1}9(u,v) Nr. (du, dv), (2.3.4) 
RXR+ 


where 6, is the Dirac mass at -H 
The number of mRNAs alive at equilibrium can thus be represented as 


M= f AucocurM (dude) = frucocurertn Ni (du de) 


which is precisely the expression of Proposition [2.3.2] When the activation rate of the gene goes 
to infinity, the point process M is simply a marked Poisson point process and M has a Poisson 
distribution with parameter p2 = A2E(o2). 

We now use this representation to get an explicit expression of the variance of the number of 
mRNAs at equilibrium. 


lif f : R x R} —> R is integrable with respect to the measure M(du, dv), then 


Mf) © i a, Huse) Mau, de) = f a, Mon Ale v) Nya (du du) 


The point process M is identified by the sequence ((rn,02,n),n € Z), where rn are the times of messenger birth 


conditionally to the gene activation and o2,n the corresponding lifetimes. For this reason the point process M 
can be represented also as M = 5, (rm, ) hence the expression 234. 


C2,n 
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Proposition 2.3.3. If the distribution of the lifetime of a mRNA is F (dx), the average of the 
number M of mRNAs at equilibrium is given by 


E(M) = 64p2 = 0 A2E(02). 


The variance of M is 


+00 +00 
var(M) = E(M) + 29564 (1 — a) | f e “’F,(u)Fo(u+v) du du (2.3.5) 


where F2(x) = Fo([0,a]) and Fo(x) = (1 — Fo(a))/E(o2), A = A] + À and 6, =AP/A. 


The formula for E(M) is in fact quite intuitive: 0} À2 is the production rate of mRNAs and 
E(o2) is their mean lifetime. 


Proof. Conditionally on the process (Y(t)), M follows a Poisson distribution, hence for z € [0,1], 


0 
2 (2™ | (Y (t))) = exp (ne i L. Liu<o<cu+v,Y (u)=1} du raw) 


0 
= exp (-x0-2 | Liy (u)=1}P(o2 = —u) du) (2.3.6) 
by taking f(u,v) = —log(z)l{yY(u)=1,u<0,u+v>0} in Relation (A.0.1). If we differentiate for- 


mula (2.3.6) with respect to z and take z = 1, we obtain 


0 
L{y(u)=1}P(02 > —u) du. 


E (M | (Y(t) = X2 I 


Since (Y (t)) is at equilibrium, P(Y (u) = 1) = 6}, hence integrating the last relation gives 
0 
E(M) z i | P(a2 2 —u) du = 04 Ag 


— CO 


Ga 


(02). 


If we differentiate twice Formula (2.3.6) and substitute z = 1, we obtain 


2 


+00 
(MM — 1) | (Y (0) = X ( [T tivenn du) 


= Bf Lpy(—u)=1,Y( v)=1}P (02 > u)P(o2 > v) du dv, 
R 


+ 


which, integrated with respect to (Y (t)), gives 


E (M°) - E(M) = x | P(Y(—u) = 1,Y (—v) = 1)P(o2 > u,02 > v) du dv, 
R 
where the random variable 53 is independent of oz and has the same distribution. Using rela- 
tion (2.3.1), for u < v and A = Af +27, we get 


P(Y (—u) = 1, Y (—v) = 1) = P(Y (—v) = 1)P(Y (—u) = 1 | Y (—v) = 1) 
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Therefore E(M?) — E(M) is the sum of 


xa | P(o2 > u, oz > v) du du = (120: E(02))? = (E(M))? 
R2 


+ 


and, up to the multiplicative factor 2136, (1 — 64), of 


1 P(o2 > u,02 > ide C0 Hs du dv. 


+ 


The proposition is proved. 


Relative variance 


The relative variance of the number M of mRNAs at equilibrium, see (2.3.2), is defined as 


aa = in ! gat Ir, (2.3.7) 


by relation (2.3.5), with 


+00 
An f e F,,(u)Fo(u-+v) du dv, 
0 


where A = Aj + Ay. When the mean E(M) is fixed, Ip, is the only quantity which depends on 
the distribution of the lifetime of an mRNA. 

To conclude this section, we now apply the previous general formulas to specific choices 
of the probability distribution. In particular, we will get an analytic formula of the previous 
for exponential and deterministic distributions. These assumptions are not completely realistic 
from a biological point of view; nevertheless they are used to stress the impact of probability 
distribution on the messenger variance. If the distribution of the lifetime of an mRNA is the 
exponential distribution E,, with parameter 42, one gets 


1 


I = —____, 
Bua — Quo(A + p2) 


If the lifetime of an mRNA is the deterministic distribution D,, with a unit mass at 1/42, the 


above formula yields 
1 A 
Ip, a (ee a 
Dug A2 (« T =| 


Straightforward calculations with these formulas show that I Ens <I Dyz The ratio I Dp /I Ens 
varies in fact between 1 and 2, see Figure The variance for the exponential distribution is 
smaller than the one for the deterministic distribution with the same mean. This result is not 
quite intuitive if one takes into account that the variance of the exponential distribution is quite 
large. 


2.3. GENERAL RESULTS 33 


Ratio In, /TE,,, 
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Figure 2.3: Variance of M: Comparison of Ip,, and 1#,, defined by (2.3.7). Deterministic versus 
Exponential. 


2.3.3 Proteins 


Recall that if an mRNA is created at time u and has a lifetime v, then on the time interval 
|u, u + v] proteins are created according to the marked Poisson point process M; x, With intensity 
A3 dz ® F3(dy). Denote with (s,,n € Z) the sequence of mRNA births, the point processes 
N re are supposed to be independent. The instants of creation of proteins together with their 
lifetimes can thus be represented by the following point process 


P= M(du, av) f 5 (x,y) Nx, (da, dy), (2.3.8) 
[u,u+v]xR+ 


RxR, 


where M is the point process defined by formula (2.3.2). 


Proposition 2.3.4. The number P of proteins at equilibrium can be represented by the random 
variable 


p= | Live Naa du, dv) f Lizr<o<r+y,u<e<u+o} Ny (dx, dy). (2.3.9) 
RxR4 R 


XR+ 


Proof. The derivation is quite straightforward. If an mRNA alive between time u and u + v 
generates a protein at time x with lifetime y, this protein will be present at time 0 if x < 0 < x+y. 
The argument that this is indeed the representation of the number of proteins at equilibrium 
follows the same lines of the proof of Proposition [2.3.2] 


Theorem 2.3.5. If the distribution of the lifetime of a mRNA fresp. protein] is F>(dx) fresp. 
F3(dy)/, then the expected value of the random variable P, which is the number of proteins at 
equilibrium, is given by 


(P) = 64923 = f+ 2\3E(02)E(03) 
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and its variance var(P) can be expressed as 


+oo (—stt)A _ 2 
var(P) = ) + P) taak | f | EA du 
Ry —s 


2 
+ p3p364(1 — ix) f e AG ue) (2) TT (tg) F3 (ui) dui dei, 


4 
R4 i=1 


ds F(dt) (2.3.10) 


where, for j = 2, 3, Fj(x) = F;([0,x]) and Fj(x) = (1 — Fj(x))/E(o;), A = Af +A and 
ô4 = AT À 


The expression of E(P) can also be understood intuitively. Recall that E(M) = 64A2E(o2) 
is the mean number of mRNAs, A3E(M) is therefore the production rate of proteins and E( 
is their mean lifetime. 


Proof. We start with the simple case of the mean. For fixed u, v € R+, formula (A.0.3) gives 


u<x<u+v u<t<u+v 


E p l pecoce+y, itera = x f Lya<o<saty,, dr F3(dy) 
RxRy RXR+ 


Integrating this expression with respect to 1ty(u)=1} Nj, (du, dv) and taking its expectation, we 
get 


EIP | Y ©) = 

= AE p Lpy(u)=1} E TRE 2<0<z+y,1 Vx, (dx, dy) 
RxRy RXR+ u<x<u+v 

= ÀE | Liy (u)=1} / l fe<o<ety, dz F3(dy) Ny, (du, dv) 
RXR+ RxR} lu<a<u+v 


= x | L{y(u)=1} ji l fe<o<ety, dz F3(dy) 
RxR4 RxR} lu<a<utv 


= ors f Ly (uta)=1} L{x<o} L{u<o}P (02 > —-u)P(o3 > —x) dx du, 
R_xR_ 


(NM. ), (Y(t) Maa (du, dv) 


(Y(¢)) 


(¥ (t)) 


À2 du Fy (dv) 


where we have used again formula (A.0.3). A further integration gives finally the expectation 


Recall that My, can also be represented as Ny, = (5»,02n) and 


P-> 


nez” RXR+ 


L{y(s,) ul 2<0<x+y, Wine (de, dy). 


3 
Sn <T<Sn+02,n 
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Denote by E the conditional expectation il: | (Y(t)), (Sn, o2,n)]. The conditional generating 


function E LP ] can be written as 


F (11 exp (- log(2) | HY (n=) ly 2<0<x+y, Wis (dx, 1) 
RXR+ 


nez Sn LT<Sn+02,n 


= II R fe (- log(e) | MY (snjit g æ<O0<r+y, ja du) 5 
RXR+ 


nez Sn <T<Sn+02,n 


since the point processes M; eae n € Z, are independent. 
The nth term of this product is, applying Proposition to the marked Poisson point 


Sn 


processes Ny? , 


exp ET = Zz) gy(s,)=1} I. "i æ<O0O<x+y, } co Fs) à 
+ 


Sn LT£LSn+02,n 


ED) 


(2? ) with respect to Ny,, the generating function can thus be written as 


E (2? |(Y(t))| =E fe (- f g(u,v NW; an) ; 
RXR+ 


g(u,v) = As(1 — Det f Lya<o<r+y,y dr F3(dy). 
R 


XR+ u<xr<u+v 


Applying again Proposition [A.0.5]to the marked Poisson point process Ny,, we get 


E (2°1(Y(E))) = 


By integrating 


where 


u<O<u+v, 


= exp -M [au Fy(dv) | 1— exp -a1 =) f L fe<o<s+y,) dr F3(dy) 
R Ry RxR} { } 
Y (u+x)=1 


In order to obtain an expression for E(P(P — 1)|(Y (t))), we have to differentiate twice the 
previous formula with respect to z and evaluate it at z = 1. The resulting formula should then 
be integrated with respect to (Y (t)) and we can get formula (2.3.10), by using similar arguments 
as in the proof of Proposition [2.3.3] (with more technical calculations). 


2.4 Results of MPPP description of Three-Stage model 


To show the effectiveness of the analytic formula of the protein variance, we consider the 
cases of exponential and deterministic distributions. More realistic distributions are considered 
in Figure and [2.5] This specific analysis will give insights of the impact of the distribution 
choice on the protein variance. 


Explicit formulas of protein variance 


In each case the average lifetime of an mRNA [resp. protein] is 1/u2 [resp. 1/u3]. Recall that 
6, = à] /A and A= Af +7. As for the case of mRNAs above, even if from a biological point 
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of view these assumptions are not completely realistic, this analysis shows the impact of the 
distribution on the variance, and therefore of the necessity of having closed form expressions for 
a large set of distributions. 


Exponential Distribution. If the distribution of the lifetime of an mRNA [resp. protein] is 
exponential with parameter u2 [resp. 43], then formula (2.3.10) gives the classical result on the 
variance, see Paulsson [54], 


var ®) (P) = 


> À3 A2A3(1 — 8+) (A + p2 + u3) 
a) (: eae ne E me) | een) 


Deterministic Case. If the lifetime of an mRNA is exponentially distributed with parameter 
jig and the protein lifetime is deterministic, equal to 1/3, then formula (2.3.10) gives the identity 


var(D) (P) = E(P) Í 42% (1 a emis) 
H2 H2 
_ 2A2A\3(1 — 64) M2 | H3 f er Mus] 
= A-k M 
H3 —u2/ 1 1 
Ali halha) gN Rs 2.4.2 
o Jaja] e 
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Figure 2.4: Square root of relative variance of the number of proteins with a fixed mean. These 


curves are obtained using the analytic formulas (2.4.1) and (2.4.2). 


Numerical analysis: a counter-intuitive result 


Relation (2.3.10) gives an explicit, but intricate expression for the variance, we present some 
numerical experiments based on this formula. Figures and consider the case when 


the average number of proteins at equilibrium is fixed and equal to 300, that À2 = 0.02sec.~!, 
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I =0.01sec.~' and that the average of the lifetime of an mRNA [resp. protein] is 172sec. [resp. 
1000sec.|. We have considered several possible choices for the distribution F3, it is assumed that 
all the other distributions are exponential. When the distribution F3 is Gaussian, S denotes its 
variance. 
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Figure 2.5: Ratio of the square root of relative variances of the number of proteins with de- 
terministic and exponential lifetimes respectively. Each curve corresponds to different level of 
protein expression and is obtained using the analytic formulas and (2.4.2). Note that the 
mean is fixed along each curve. 
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Figure 2.6: Square root of relative variance of the number of proteins with a fixed mean. The 
curves corresponding to Gaussian distribution are indexed with respect to the parameter S = 
on /E(N), where E(N) is the average of the Gaussian lifetime and oy its standard deviation. 
The curves are obtained via Monte Carlo simulations and the statistics are obtained numerically. 
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The counter-intuitive result studied at the end of Section [2.3.2]is now observed in the case 
of proteins. The exponential distribution has large variance, in fact if X is exponentially dis- 
tributed with parameter \, then var(X) = E?(X), while a deterministic lifetime has obviously 
no fluctuations. Surprisingly, by replacing a noisy exponentially distributed protein lifetime with 
a deterministic one, the fluctuations of the protein number result increased. This unexpected 
result is valid for any choice of parameters and is shown in figure [2.4] 

If we consider a Gaussian distribution for the protein lifetime, the curves are comprised in 
between the exponential and deterministic cases. Moreover, the profiles of variance correspond- 
ing to a normal distribution show a precise behavior: the profile corresponding to a Gaussian 
random variable with small variance is close to the deterministic curve and we obtain a monotone 
decreasing sequence of curves as long as we increase the variance. These results are shown in 


both figures [2.6] and 
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2.A Appendix: survey of few classic models of gene expres- 
sion 


A systematic and accurate use of stochastic models to investigate gene expression can be traced 
since the late 70s with the models proposed by David R. Rigney in the late 70s ({64], [63]) 
and by Otto G. Berg [5]. These models are still key references for the models of stochastic 
gene expression, since, despite the many simplifications and assumptions, the authors develop a 
complete mathematical analysis of their model, proposing analytic formulas for the first moments 
of protein numbers in a single bacterium and within a population. Both works predicted mRNA 
and proteins fluctuations taking into account deterministic cell growth, chromosome replication 
at fixed time and partitioning of proteins between daughter cells. 

These pioneering works and the techniques they used, were brushed up in the 90s by the 
works of Peccoud and Ycart and McAdams and Arkin [43]. In the last paper the authors 
model gene expression using a stochastic formulation of chemical kinetics derived by Gillespie 
[19], predicting noticeable variability in protein numbers between individual cells. This was the 
beginning of an intense research and studies on stochastic models of gene expression (Paulsson 
et alii [55] (53) 54), Swain et alii [69] [67], ...). 

In this section we present some of the most relevant models that can be found in the literature 
that deal with stochastic modeling of gene expression in bacteria, concentrating on the models 
which have been taken as a first reference for our work and serve as introduction to the classic 
techniques used. 


2.A.1 The Rigney’s model 


We focus mainly on the model of protein production in a single cell, see [64]. We will discuss 
briefly the model at population level, which considers new features such as as cell age and cell 
division. 


Introduction 


Until recent years, experiments measured in vivo protein synthesis as an average over the sub- 
populations of cells in each sample. However, few experiments in he early seventies observed 
that the average number of molecules for certain enzymes increases linearly with respect to time, 
during a cell cycle. When the gene is duplicated this behavior is still present, but it proceeds 
with twice the previous rate. 

These observations served as starting point of the modeling work of Rigney and Schieve, 
which, up to our knowledge, were among the first to introduce a probabilistic description of the 
synthesis of each protein specie within a single bacterium. There were in fact physical reasons to 
believe that chaotic motion of molecules manifests in form of fluctuations and also direct a visual 
evidence: the times between successive mRNA transcription initiations, by the DNA-directed- 
RNA polymerase at given promoter, are randomly distributed, as was shown by Miller, Hamkalo 
and Thomas in 1970. The subsequent transcription and translation events were also thought to 
be random processes and a stochastic description of the phenomenon seemed natural. 


The model 


The model described in [64] should predict on one side the linearity in the average production 
rate, on the other side it should allow individual cells to show fluctuations in their individual 
rates of synthesis. 
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The rate at which each protein species is produced in each cell is supposed to be principally 
determined by the random frequency at which polymerase molecules bind to the corresponding 
promoter site on the DNA. 

The promoter region corresponding to a specific protein may be in one of two possible states: 
either a polymerase molecule is bound to the promoter (state “on”), or it is not (state “off”). 

This model does not considered at all the messenger degradation, nor the bacterial dilution. 
The interest is focused mainly on the transition in which a bound polymerase molecule leaves 
the promoter region to begin the production of mRNA strand. We denote with N; the number of 
times this particular promoter becomes free over an arbitrary period of time t and it is assumed 


N; x (produced enzymes because of this promoter), 


i.e. the number of transitions is proportional to the number of enzyme molecules eventually 
produced. This allows to transfer directly the analysis on the promoter to the protein level. 

The transitions allowed by the model may be described as a Markov jump process. With 
few calculations is possible to compute the probability distribution for the time between suc- 
cessive mRNA initiations at a specific promoter. This distribution can be used to calculate the 
probability of certain mRNA initiations over the time period t. 

Suppose that cells have been growing under constant conditions for several generations, then 
the fraction of promoters in each of the states of the model will not change in time, that means 
that the random process has reached a stationary state (there are changes at level of single 
promoters but not at population level). 

The model is intended to quantify the number of mRNA of a particular protein type initiated 
on the ensemble of promoters over a period t. The problem can be stated in the context of 
the renewal theory, where the holding times are here the time of detachment plus the time of 
attachment of the promoter, and where the renewal process is represented by the number of 
mRNA initiations N;. The standard renewal theory tells us that the average number of mRNAs 
initiated on the promoters over a period t is linearly proportional to time. Since the number of 
proteins is assumed to be proportional to initiated mRNA, the protein number should be linearly 
proportional to time. These results are based on the assumption that the transition parameters 
do not change during the cell cycle. 

Here we consider the two-states model as in [64], but we make direct calculations and we 
slightly change notation with respect to the original paper. 

Suppose that the promoter region of a particular gene has two possible states: state “0” the 
promoter is free, state “1” the promoter is associated with a polymerase. The transition from 1 
to 0 correspond to a mRNA initiation. 

Be X; € {0,1} a stochastic process such that the time Fy, it stays in state “0” is exponentially 
distributed with parameter Ao, while the time it stays in state “1” is exponentially distributed 
with parameter A; and is denoted by E\,. The process X; is a continuous time Markov chain 
with state space J = {0,1} and with Q—matrix 


_f -^o Ao 
tne 
The Kolmogorov forward equations are 


P| = PQ 
Po(i, j) = dij: 
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which admit the solutions 


P,(1,1) = = oo + — = (2.4.1) 

P,(0,0) = en ae al k (2.A.2) 

P,(0,1) = =e at ae = rs (2.1.3) 

P,(1,0) = = an ta a x (2.1.4) 
The stationary distribution is therefore given by 7(0) = sai m1) = iS 


In order to evaluate the number of produced mRNAs we have to compute the time interval 
between two subsequent messenger initiation. Denote with T the random variable 


T =F), +E. (2.A.5) 


To compute the probability density function (p.d.f.) of T, we use the Laplace transform 
as in the original paper of Rigney. Since Fy, and Æ\, are independent, and since the Laplace 
transform of a random variable exponentially distributed with parameter A; is given by 


Ài 


£(Ba,)(2) = E [~] = so, (2.A.6) 
then 
L(T)(2) = E [e7] = E [eat] 
= E [e72] E [eE] = AoA1 (2.A.7) 


(Ao + 2)(A1 +2) 


The previous formula can be written as a sum of the type £(T)(z) = Aoz + Ay xh: with 


`i Xo 
= A = . 
A= Ne RE 


Using formula (2.1.6), we have that the p.d.f of the random variable T is given by 


Ao 


fr(x) = Ange 7>? + Are te = AE (eos = a), (2.4.8) 
17 AO 


Using the previous formula we are able to compute the average inter-initiation time, which is 


RS Ao + À 
ET] = | moda Ea, (2.A.9) 

0 AoA1 
It is possible to relate the distribution of the process N; with the time T between two occur- 
rences. In fact if we denote with T;, i = 1,2..., the time between the (i — 1)" and i™ event, 

and with > 
S= T; (2.A.10) 
i=1 


the total time up to the r*e occurrence, then we have the identity 


{Ne <r} ={S; > th. 
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It is possible to derive then the statistics of the process N; and it results that the average 
number of initiated mRNAs over the interval (0, t] is 


i t No + À 
EIN] = Ti = | x Je (2.A.11) 


Moreover the variance of the number of initiated mRNAs gives the formula 


Ao1 2(AoA1)? ) 2(AoA1)? 
Ao + À (lo + A) (Xo + A1)4 


Var (Nt) = ( (1 — @ Cota)t) (2.4.12) 

Rigney and Schieve in this article point out how the noise is inherent of the protein production 
and not a mere consequence of the measurements. The mean mRNA can be derived also by 
averaging over a population, but the variance gives important insights of the kinetics of the 
protein production itself. 

This model has lots of simplifications and look just on a specific part of the whole protein 
production. Nevertheless, it shows how the analysis of a simple stochastic model of the protein 
production can supply important information on the kinetics and, consequently, on the biological 
and chemical mechanisms that lie behind gene expression. 


Towards a population model 


In the paper of Rigney 1979 [63] the author analyses few problematics concerning a population 
of growing cells. The starting point of the author is the fact that the 90% of the DNA of 
bacteria growing in a glucose are rarely transcribed. Let consider constitutive genes, i.e. promoter- 
controlled structural genes which are rarely transcribed. The latter characteristic make stochastic 
modeling well-suited for the analysis of the protein production to give an estimation of the 
variability of the number of produced proteins. 

Rigney considers here a population that is able to reach a steady-state, i.e. all the experimental 
conditions and the characteristics of the bacterial population are constants and are such that 
the probability that a cell, selected randomly, shows a property is independent of time. Such a 
population is in the exponential growth phase. 

The population can be divided in sub-populations, of a specific window of ages [a,a + Aa], 
where a is the “age” with respect to cell cycle. This is fundamental when considering the pop- 
ulation perspective, since we may not expect that the cells are synchronous. The aim of the 
Rigney’s population model is to predict distributions of low-level, constitutive proteins. 

The main random process under analysis is the number X(a) of a specific protein in a cell 
in the bacterial population. Since the author deals with a whole asynchronous population of 
bacteria, he makes further simplifications in order to have a treatable model. The probability 
that transcription initiates is supposed to be constant, independently of the cell age, the number 
of polymerases, chromosome conformation and other characteristics of the cell itself. Each mRNA 
is supposed to produce deterministically s proteins during its lifetime, and these s molecules are 
supposed to be produced simultaneously since the inter-initiation time for constitutive proteins 
is much greater of the mRNA lifetime, and of the order of the cell lifetime. In order to consider 
cell growth and division still being able to perform mathematical analysis, Rigney supposes that 
there is one or two copies of the gene depending on cell age. This is an approximation since the 
number of gene copies depends on various factors, in primis on the position of the gene in the 
DNA helix. Moreover, the duplication time 7, of the gene and the time, or age, of cell division 
T2 are supposed to be deterministic, supposing that the proteins in the mother cell to be equally 
probabilistically divided between the daughter cells. 

The protein degradation is supposed to be exponentially distributed with parameter y. 
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The main mathematical tool used here are generating functions. The author starts by study- 
ing the probability P(xo — x) that a cell, having xo molecules of the specific protein at the 
time of its division, has a daughter cell having x molecules at the time of its own division. This 
probability distribution can be used to derive the generating function of t(x,72), which is the 
probability to have x molecules in a cell just before division, under the hypothesis that the sys- 
tem has reached the equilibrium. This last distribution gives finally the generating function for 
a(x, a), which is the probability to have x molecules of the protein in a cell of age a. The author 
concludes the technical computations by showing how derive the moments of the probability 
distribution. 

Rigney proposes then a two-variables extension of its model. In this extension he considers 
two proteins, which have different production characteristics. We point out that he does not 
consider any competition in the production chain of the two-proteins model. The main difficulty 
is that the final distribution will depend on the quantities of the two proteins. 

The author underlines the importance of the hypothesis for the analytic tractability of the 
model and he states 


|...] for models in which the initiation or degradation transition probabilities are 
non-linear functions of the random variables, it will generally be impossible to find 
exact, analytical solutions. 


Despite the simplifications made by Rigney, one of the key point of his population model is that 
he derives the generating function for the probability distribution at population level. 


2.A.2 Paulsson’s model survey 


Johan Paulsson is one of the most prominent researcher in the field of stochastic gene expression 
and this is testified by the number of important works he has published on this argument [55] 
53 [54]. 

We focus on the Paulsson’s work [54] in which the author focuses on the common charac- 
teristics of many stochastic models of gene expression, giving a widespread perspective on the 
field. 


Introduction 


Stochastic models are proposed to take into account the fact that cellular events depends on 
the random collision of molecules. Since many of the cellular events deal with small numbers 
of molecules and are not independent, a probabilistic description is often the best description 
of such processes. A stochastic model is fundamental if we want to investigate and understand 
the process of gene expression, even in cases in which randomness is absent. In fact this lack of 
randomness should be somehow explained in probabilistic terms. 

Most of the stochastic models that can be found in literature focus on genes, mRNAs and 
proteins, including all the other processes in effective transition rates. This fundamental descrip- 
tion may be indefinitely complicated, but in order to better understand the first principles the 
author chooses to focus just three main processes: gene activation/deactivation, transcription 
and translation. 


The model 


Paulsson’s supposes to have a constant number of copies of the specific gene, denoted with n?**, 


each copy switching independently between two states (active/inactive). The cell growth, cell 
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division and gene replication are not included in this model. The author denotes with n1 the 
number of active genes. 

The three main processes modeled, i.e. active genes, mRNAs and proteins, are supposed to 
chase each other. In particular the active genes will affect the number nz of mRNAs, which in 
turn will affect the number n3 of proteins. Messengers and proteins are both degraded. The first 
have short lifetimes, especially in eukaryotes, and are destroyed by the degradative machinery, 
while proteins are more sensitive to the dilution which they undergo because of cell growth. 


Figure 2.7: Three-Stage model. The gene activation/deactivation occur at rate A; and y re- 
spectively. Transcription and translation occur at rates À2 and A3 respectively, while the mRNA 
and protein degradation are supposed to be exponentially distributed with rates 42 and us. 


The Kolmogorov equations for the Markov process (n1(t),n2(t),n3(t)) are then 


dP n EL nm max max 

oP (stats) =A (ni — ny + 1)P(ni — 1, 12,3) — Ant — n1)P(n1, n2, ng) 

+ pan +1)P(n + 1,n2,n3) — ini P(n1, n2, ng) 

+ Agni P(m1,n2 — 1, ng) — Agni P(n1,N2, Ns) (2.4.13) 
+ polna + 1)P(n1,n2 + 1, n3) — poneP(n1, n2, ng) 

+ Asn2P(n,n2,n3 — 1) — Agno P (n1, n2,N3) 


+ [3 (n3 a= 1)P(n1, n2, n3 + 1) — ans P(ra, n2, n3), 


where 


P(ni, n2,n3) = P [n1 (t) = n1,no(t) = n2, n3(t) = n3]. (2.4.14) 


First moments of the modeled processes 


We detail now the techniques used to obtain analytic formulas of the first moments in many 
works found in literature. We may restrain our analysis to the case n}°* = 1, since the pro- 
duction relative to each gene copy is independent. Therefore in the general case, because of this 
independence, then the average and variance of the number of messengers and proteins will result 
multiplied by the constant nP. 
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Since nı € {0,1}, we may restate the Chapman-Kolmogorov equations (2.4.13) by condi- 
tioning to the random variable n1. Define 


Fay) = P [n2(¢) = x, na(t) = y, nı (t) = 0] (2.A.15) 
Ju) = P [no(t) = z, na (t) = y, nı (t) = 1], (2.A.16) 


then the Chapman-Kolmogorov equations for f(z) and g(x,y) read 


d 
Z few) = HiG(a,y) — A fæ) + Hale + 1) fay) — Het (a) HAT fey) 
dt 


— A32 flay) + u3 (Y + 1) fiæ,y+1) — H3Y fley) (2.A.17) 


d 
dpe) = A1 fley) — H19(x,y) ale À29(x—1,y) = À29(x,y) + [a(x + Litas 


— H229(2,y) al À3T (x, y—1) = A3X9(a,y) + ua(y + 1)9(x,y+1) = B3Y9(a,y): (2.A.18) 


where, for typographic issues, we have omitted to explicitly report the dependence on the time 
variable t of f(x,y) and g(a,y)- 
Define the generating functions 


F(z,8) =X 23 fy) (2.A.19) 
vy 
G(z,s)= 5 F5 Oia: (2.4.20) 
vy 
The Kolmogorov equations (2.A.17) and (2.A.17) are finally transformed to the equations 
OF OF OF 
ae mG(z,s) — ALF (z,s) — u2(z — 1) 57 s) + A32(s — 1) 5, s) (2.A.21) 
OF 
- mls- D (2,8) 
ac = ME (z, s) — wiG(z,s) + Az — 1)G(z, s) — pa(z — Ne, s) (2.A.22) 


+ \32(s 1) (2,5) [3 (8 DZ (2,5) 


The aim of this model is to obtain analytic close formula of protein average and variance. 


Therefore we may obtain from equations (2.A.21) and (2.A.22), simplified equations for the 
protein average and variance. Observe that 


OF OG 
By (5) = E[ng,m = 0] Bz C9) = E [ng, n = 1] 
z=s=1 z=s=1 
OF OG 
=~ (2:8) = Efns,n1 = 0] (4) = Efns,n1 = 1], 
ðs zZ=s=1 ðs z=s=1 
where E[-,n1 = 0] is the expected value corresponding to density function (2.4.17) and E[-,n, = 


using density function (2.A.18). Note that from (2.A.22) we have 


= À (1 — G(z,s)) 


— #1G(z, 8) 


“ 


z=s=1 z=s=1 
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since G(1,1) = P[ny = 1] and F(1,1) = P[ny = 0] = 1 
engers and of the proteins 


formulas for the averages of the mess 


: [mRNA] = E 


— P [ni = 1]. We obtain therefore the 


[na] = i [n2, nı = 0] + i |ne, nı = 1] ; (2.A.23) 


l [P] = E [ns] = E[ng, nı = 0] + E[ng, ni = 1]. (2.A.24) 


In analogy to first derivatives formulas, we compute the second derivatives, which will be 


used in the following: 


@F i 0G . 

92 (z,5) = Efn2(n2 — 1),n1 = 0] 92 (z, 8) = E[n2(n2 — 1),n1 = 1] 
Z=s=1 Z=s=1 

O° F i 0G l 

332 (z, s) = E[n3(n3 — 1), nı = 0] 532 (z,s) = E[n3(n3 — 1),n1 = 1] 
Z=s=1 Z=s=1 

@F : 0G 

20s (z, 8) N = Efnons,n1 = 0] 20s (z, 8) Ze = E[ngn3,n, = 1]. 


The average number of messengers is described by the differential equations 


Oa : 
gE rom = 0] = UlË no, ni = 1] 
ô a i 
dt no, ni = 1] = Àl i [n2, n1 = 0] 


while the equations 


g 
at 


E [n3, nı = 1] = À i [na, nı = 0 


9 
ot 


— ME [no, nı = 0] — H2 i [n2, nı = 0] (2.A.25) 


— Hı i [ne, n= 1] + A2G(1, 1) — H2 E no, nı = 1] 4 (2.4.26) 


E [n3, 71 = 0] = Wi ns, n1 = 1] — ME [n3, nı = 0] 


| — 4E[n3,m1 = 1] 


give the average of protein copy number. 
To obtain the formula of variance, we have to write down the differential equations for the 
second moments, which for the messengers read 


o 
Ot 
— 22 
ô 
HE [na(n2 — 1),n1 = 1] = À 


+ 2A2E 


z na(n2 = 1), nı = 0 


D [na(n — 1), nı = 0] 


[no, nı = 1] mu 22 


+ A3E [n2,n1 = 0] — p3E [n3, nı = 0] 
(2.4.27) 


+ A3E [no, ni = 1] — u3Ë[ns,n1 = 1], 


(2.A.28) 


LÉ [na(n a 1), nı = 0] = +p E [no (na — 1), nı = 1] — Ài E [na(n 1), ni = 0] (2.A.29) 


— Wh na(n = 1), nı = 1] (2.A.30) 


ù [na (n2 = 1), nı = 1] . 


In order to obtain a close system of differential equations for the second moments of the number 
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of proteins, we need also the differential equations for the cross-moments: 


2 
ðt 


| © 


l [na (na — 1), nı = 0] = + 


i [ng (n3 = 1), nı = 1] 


+ 2A3E 


i [ng (n3 — 1), nı = 1] = À 
+2À3 


i [nens, nı = 1] = 2u3 


+ 


[nons, ny = 0] — 23 


E [ng (n3 = 1), nı =0 


i [na(n — 1), nı = 0] — 


mE [na(n — 1), nı = 1] 


E [ng (n3 = 1), nı =1 


l [ng (n3 = 1), nı _ 0] 


= +u 
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(2.A.31) 


(2.A.32) 


i [nens, nı = 0] — H2 E non, nı = 0] 


D) [nens, ny = 0] 


— 43 


i [nens, nı = 1] 


— u3 i [nang, nı = 1] + X 
+ À3 i [na (n2 — 1), nı = T 


= ME [nong, nı = 0] — m 


ù [nang, nı = 1] = À 


1 (n2n3,N, = 0] + X3E N2, N1 = 0] + À3 


i [nons, nı = 1] — H2 


E [ng, nı = 1] + À3 


i |ne, nı = 1] 


(2.4.33) 


i [na (n2 = 1), nı = 0] 


l [n2n3,n1 = 1] 


(2.4.34) 


The previous systems of differential equations give the first moments of messengers and 
proteins. In the following Section we derive the solution of these systems at equilibrium. 


Analytic close formulas for first moments 


We can obtain the mean and variance of the number of messengers and proteins at equilibrium, 
by solving the equilibrium differential equations associated to the differential equations presented 
in the previous section. It results that 


: Any 
nm — 
[nı] AETA 
Aypanex 
Var (nı) = | 
(m) (A1 +1)? 
: in A àz Apnmax 
2 fog Ay + Ha 
max 2 max 
A T u, X Ses 
ue À +y pluz + Ai + pr) (M + p) 


(2.4.35) 
(2.4.36) 
(2.4.37) 


(2.4.38) 


where we have reintroduced the constant n°°*, which accounts for possible copies of the specific 


This reasoning can be extended to the three-stage model, which lead also to the protein 


2 
À A3 À npex 


gene. 
statistics 
A3 À Ange 
[ng] =, 
H3 H2 Ài + Hı 
À À À max 
Var (n3) = RU 
H3 H2 Ai + Hı 


+ 
Hops (Ai + u1) (u2 + ps) 


uo3lu2 + H3 + Aq + p] 


62) Anny 
(Ar +)? (u2 + u3)[ue 


M23 


+ Ar + pa][es + À + pa] 


(2.4.39) 


(2.4.40) 


(2.4.41) 
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Paulsson gives analytical formulas for both averages and variances of all the three processes 
that are modeled and gives possible interpretations of the obtained results. Analyzing the re- 
sulting variance, he interprets different addends as noises inherent to the studied process itself 
or deriving from fluctuations in upstream processes in the gene expression. 

In a Paulsson’s paper of 2004 [53] the author present a possible interpretation of the linear 
noise approximation for gene expression. This concept is restated and explained in [54], where 
the author shows possible applications of this approximation in order to compute statistics of 
the three-stage model with non linear transition rates, where the only constraint is that each 
stochastic event add or remove one molecule at a time. Here the three-stage model as represented 
in figure [2.7] is a special case, since in this case the linear approximation results to be exact. 


2.A.3 Swain’s model 


Peter Swain, Michael Elowitz, Eric Siggia, Vahid Shahrezaei, Mukund Thattai and Alexander 
van Oudenaarden have produced many papers on the gene expression, trying to analyze different 
aspects of protein production using tools of the theory of probability. 

We left aside for the moment the problem of “intrinsic” and “extrinsic” noise, analyzed among 
others by Swain and Elowitz [69] and by Thattai and van Oudenaarden [71]. We will focus on 
the so-called “intrinsic noise”, which is the noise of the quantity of proteins related inherently to 
the production chain, as defined in the work [69]. 


Introduction 


The authors consider here the stochasticity which is connected to the protein production: random 
timing of chemical reactions and stochastic encounter of cellular compounds. 

Typically the experimental data are compared with models which predict the mean and 
possibly the variance of the proteins, since the analytic formula of the probabilistic distribution 
of the proteins is often hard to derive. 

Using few biological data, as the fact that mRNA degradation is faster than protein decay, 
the authors propose an approximate method for solving the master equation, giving the transient 
and stationary approximated probabilistic distributions of the protein number. 


The model 


Following the main models in the literature, Swain and Shahrezaei consider the three-stage model 
as the fundamental model to describe gene expression, see Section[2. A .2]for details. The promoter 
of the gene, corresponding to the specific protein of interest, can be in one of two states: active or 
inactive. These transitions may occur because of the binding/unbinding of specific proteins or by 
pausing polymerase. This is also one of the basic mechanisms considered by Rigney in his work. 
The other two stages considered are the messengers and protein dynamics. The transcription of 
a new mRNA may occur only if the promoter is active. Transcription, translation and the decay 
of proteins and messengers are modeled as first-order chemical reactions, i.e. they are supposed 
to be exponentially distributed. 

The fundamental hypothesis of this model is that the messenger lifetime is much shorter than 
protein lifetime. The protein fluctuations are so determined only by time-averaged properties of 
the mRNA and not by finer time-dependent ones. This is the key point of Swain approach to 
simplify the mathematical description in order to give an analytic approximated formula for the 
probabilistic distribution. 
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For the two-stage model the probability Pm,n(t) of having m mRNAs and n proteins at time 
t satisfies the master equation 


OPnin 
ot 


= Vo(Pm-—1.n = Pri) NE vım(Pmn-1 Ern) H do (m H 1)Pm+1,n = mPm,n] 


(2.4.42) 
+ dy [(n T 1)Pm,n+1 = nPm,n] ; 


where vp and vı are the transcription and translation rates respectively, while do and dı are 
the degradation rates for the messengers and the proteins respectively. Defining the generating 
function 
F(z,2) = 5 ale a) ahi 

mn 
and using equation (2.A.42), we find a first-order partial differential equation for the generating 
function JF AP 15F 

u u 

< — 7 |b +u) - =] 4 ut (2.4.43) 
OV ðu  v OT v 
where a = vo/dı, b = vı /do, y = do/dı and where the new variables are 7 = dıt, u = 2’ — 1 and 
v=z-l. 
Solving the previous equation and for y >> 1, the generating function is given by 


V 


(2.A.44) 
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Fan | 
if we assume there are no proteins at T = 0. So when y >> 1 and T >> 771, then 


_ T(atn) b \” (1+ be-7 \* -n —a 140 
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ed ) is a hypergeometric function. 


where T is the gamma function and 2F ee. D 


Note 2.A.1. We recall that the hypergeometric function is defined as 


a b. = < (a)n (b)n 2” 
2 ( a ) 2 T (2.4.46) 


provided that c#0,—1,—-2,.... The (x), notation stand for 


G= 1 ifn = 0, 
sn x(æ+1)...(x+n-1) ifn>0. 


The hypergeometric function may be characterized via a differential equation, the Euler’s hyper- 
geometric differential equation 


d? d 
(1-27 4 fem (a+b+ 1a -abw =0 (2.A.A7) 
At steady state the previous equation gives 
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which is a negative binomial distribution. Observe that b is the mean number of proteins produced 
during a mRNA lifetime, this quantity is referred in the literature as “bursts”. Protein lifetimes 
are usually greater than mRNA lifetimes, so, typically, no protein will be degraded during a 
messenger lifetime. Moreover if we look at the dynamics in the protein time-scale, the proteins 
produced by a chosen mRNA will appear to be synthesized concurrently, i.e. in a burst. 

The same analysis can be done in the three-stage model, with a greater effort in order to solve 
the related differential problem and to derive the approximated formulas. 

Supposing a difference in the mRNA and protein lifetimes, i.e. y >> 1, has allowed simplifica- 
tions, leading to a complete approximated description of the transient and stationary probability 
distribution of the protein numbers. This is due mainly to the fact that number of messengers 
reaches rapidly steady state, reducing protein fluctuations. When y has large values, then the 
proteins produced by a mRNA are produced in a burst, which is geometrically distributed. Nev- 
ertheless when y < 1 then protein degradation will occur during the translation of new proteins, 
which makes unlikely to have a geometric distribution for the protein production. 


Chapter 3 


Realistic model of gene expression 


The three-stage model proposed in Chapter |2| introduces a new description of gene expression 
using marked Poisson point processes (MPPP). However, if the model presented is a general case 
of the classic models, the choices of distributions made have no relevant biological meaning and 
serve as illustration of the mathematical framework. In this chapter, MPPP framework is the 
fundamental ingredient to obtain a more realistic and appropriate description of gene expression. 
The impact of such more realistic description will be analyzed together with possible relevant 
choices of probability distributions, in order to bring light on the existing modeling and provide 
new insights on gene expression and its characterization. 

Mathematical models play a crucial role in gene expression and evolve continuously in order 
to improve their qualitative and quantitative descriptive capabilities. However, a model is the 
result of the trade-off between a fine and accurate description of the biological phenomena and 
the limits imposed by the theoretical tools to characterize and analyze the properties of the model 
itself. Such a dilemma is even more manifest when explicit analytic formulas are expected. 

The classic Markovian description of gene expression [5] [54 [64] captures the main character- 
istics of the gene expression, but comes with the strong exponential assumption for the duration 
of every step. In particular, this assumption is not justified for the description of steps such 
as the protein elongation and the protein decay due to volume dilution, as discussed in Section 
[1.4.1] The critical analysis of the Markovian description of gene expression before the biological 
knowledge requires the introduction of an appropriate and more general mathematical frame- 
work. In order to include the protein elongation process, we introduce a supplementary step with 
respect to the model introduced in Chapter[2] by splitting the effective step of translation into the 
ribosome binding on a messenger and the subsequent elongation of the polypeptide chain. The 
MPPP framework is then used as the fundamental ingredient to derive a more realistic four-stage 
model of gene expression which includes protein elongation. 

The protein dilution is also considered by adopting an appropriate description of the phe- 
nomenon, obtained by coupling the stochastic description with a deterministic continuous model 
of dilution. The resulting model mixes a stochastic description of the dynamics of different com- 
ponents involved in gene expression with a deterministic description of dilution and is referred 
to as hybrid four-stage model. 

General results valid for any choice of distributions of various steps are thus derived for both 
four-stage and hybrid four-stage models. 


Plan of the Chapter. In Section we describe the main steps of a model of gene expression. 
Unlike the three-stage model introduced in Chapter [2] we add now a supplementary fourth step 
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to account for a more realistic description of protein elongation step. The obtained model is 
referred to as four-stage model and we derive the statistics of the main processes under general 
assumptions of the duration of several steps. 

The general model detailed in Section[3.1.1]is specialized with respect to the biological knowl- 
edge acquired on the underlying processes. In particular, Section [3.1.2] is devoted to a detailed 
discussion on the appropriate assumptions for the general steps, focusing on the description of 
protein elongation step and protein dilution, which are peculiarities of our model with respect 
to what has been proposed so far. 

Section [3.1.3]is devoted to an in-depth analysis of the four-stage model with realistic assump- 
tions. Because of the complexity of the analytic formulas of a realistic protein elongation, we 
consider in Section 3.2.2]an appropriate approximation of the duration of elongation step, which 
is shown to be well approximated by a deterministic duration. Given such approximation, a 
comparison with a classical approach with all exponential steps lead to a counter-intuitive result, 
detailed in Section [3.2.3 

The accurate description of protein dilution lead to precise quantification of this phenomenon 
and is a major improvement with respect to the classic approach. In particular, the introduction 
of such realistic description lead to a substantial difference in the resulting formulas: despite 
the classic formulas capture the qualitative trend of fluctuations, they overestimate fluctuations 
when dilution is the main decay mechanism. Moreover, the realistic description of protein dilution 
invalidates a result of the classic approach, i.e. the process describing the number of proteins 
at equilibrium has a wider distribution than a Poisson process. In fact, using a model with a 
realistic description of protein dilution we show that this “Poisson limit” can be theoretically 
trespassed. 

Section [3.2.5] is devoted to the analysis of the impact of the different steps on the resulting 
fluctuations in the four-stage model with realistic assumptions. It is shown that gene activity and 
protein dilution have the most important effect on fluctuations, while the realistic description 
of protein elongation have a mild impact, except for specific cases such as unstable proteins. 
Moreover, is studied the impact of rare codons, whose role is still an open question, on elongation 
step and, therefore, on protein fluctuations. 


3.1 Four-Stage Model of Gene Expression 


Protein elongation is usually included as an effective exponential step in classic models. However, 
this step has proven to be a crucial, and in some cases limiting, step in the production of a protein. 
For this reason, in order to account more precisely of this process, we introduce now a Four-Stage 
Model of gene expression, where the supplementary step corresponds to the protein elongation 
step. 


3.1.1 Model and general results 


We introduce now the Four-Stage model using the mathematical toolkit introduced in Chapter [2| 
The results obtained in this section rely on general assumptions of the duration of several steps 
in the model. The Four-Stage Model is described in Figure[3.2| where appears the elongation at 
step 5. 


Notation 3.1.1. As for Chapter[2 if not specified differently, a process is said to occur at some 
rate A, means that the duration of such process is exponentially distributed with parameter A. 
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Gene activation. 


The gene regulation is modeled as a two-stage gene activation: the gene is inactivated at rate À} 
and is activated by the unbinding of the repressor at rate Aj, see [57] [39] [36] BA [67] for details. 
The assumption that the maximum number of active genes Nmax is 1 is not restrictive, since the 
statistics are proportional to Nmax by independence and the formulas in the general case can be 
easily obtained. 

The stationary process (Y(t),¢ € R) with values in {0,1}, represents the gene status, where 
Y(t) = 1 indicates that the gene is active at time t, while Y(t) = 0 if it is inactive. The behaviour 
of the process (Y (t)) is well known, see [T8] for details, and at equilibrium it results 


P(Y =1)=6, = 1 — P(Y =0), (3.1.1) 


where 56, = à] /A and A = àf + Àj. In order to compute the variance of the other processes, 
we need the quantity 
P(Y (t) = 1]¥ (0) = 1) = 4 (1 — ô4)e7 ®t. (3.1.2) 


At equilibrium, the average of gene activity is given by 
E [Y] = ô, and the variance by var(Y) = 4 (1 — 64). 


mRNA dynamics. 
Active gene 
The active gene produces mRNAs at rate À2 and we 


denote with F2(dy) the probability distribution of the 
lifetime of a messenger, i.e. P [o2 € A] = f, Fo(dy). The LA LA LL LL 
dynamics of the messengers can be described using the 
notations and tools of MPPP theory, as shown in Chap- 
ter [2] In particular, let Ny, = (8n,02,n) be a MPPP on 
R x R, with intensity measure À du @ F2(dv), where 
Sn and 02, represent the instants of messenger creation 
and the lifetime of a mRNA born at s,, respectively. 

The point process M representing the instants of 
creation of mRNAs and the associated lifetime can be 
represented as 


Inactive gene 


M = Liy u=} lu v) NA (du, dv), (3.1.3) 
RxR 
+ Figure 3.1: Gene activation. The gene 


where 6, is the Dirac mass at z. activation/deactivation occur at rate 
The number of messengers alive at equilibrium can Af and À; respectively. 
be represented as 


M = Liu<o<ut+v, Y (u)=1} M2 (du, dv). (3.1.4) 
RXR+ 


Remark. Here the mRNA is available for translation once a small portion of the growing mRNA 
chain has been assembled. This assumption is coherent with the prokaryotic dynamics, but 
should be adapted for the eukaryotic case. In fact in this case we have to wait the completed 
messenger to be exported to the cytoplasm. If we assume this duration to be deterministic, then 
the previously defined integral should be shifted of a constant value and we should easily get the 
corresponding analytic results. 
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Using the MPPP representation of the number of messengers, we can now get an explicit 
expression of the average and variance of messengers at equilibrium. 


Proposition 3.1.2. If the distribution of the messenger lifetime o2 is F2(dx), the average num- 
ber M of mRNAs at equilibrium is given by 


(M) = 54 E(o>). (3.1.5) 


The variance of M is given by 


var(M) =E(M) + 21564 (1 — ô+) 


T — (3.1.6) 
. 7 e" Fa(u)F2(u + v) du dv, 
0 
where we use the notations F(x) = P2([0,x]) and F(x) = = (1 — Fo(a)). 
See Chapter [2] section or the paper [18] for a complete proof of these results. 


Remark. The presented results are still valid in the case in which some random variable have 
not a probability density function. In such a context, the integrals should then be interpreted in 
the distribution sense. 


> fo. 


Yor a3 ~ F3(dt) 


m $ ~ Fo(dt) 
on \ de 2( o4 ~ Fi, (dt) 


P EST Lors 
0e aby 


Figure 3.2: Four-stage model. Stage 1 - Gene activation (steps 1 — 2). The initiation of 
transcription is strongly regulated; schematically the gene is said to be in “inactive state” if a 
repressor prevents the polymerase binding and is in “active state” otherwise. The gene inac- 
tivation (respectively activation) occurs at rate AJ (respectively Aj). Stage 2 - Transcription 
(step 3). When the gene is in active state, the RNA polymerase produces an mRNA at rate 
Az. The probability distribution of an mRNA lifetime is F2(dt) and is general (step [4]). Stage 
3 - Translation (steps 5 — 6). A ribosome binds to an active mRNA at rate À3 and proceeds 
to protein elongation, whose duration a3 is assumed to have a general probability distribution 
F3(dt). Stage 4 - Protein decay (step 7). Protein decay is achieved here by proteolysis, the case of 
protein dilution being treated in Section [3.1.2] The distribution F,(dt) of the protein lifetime o4 
is supposed general for the moment. The parameters X Ay, À, Ag, indicate that the duration 
of the relative step has exponential distribution of parameter À. 


3.1. FOUR-STAGE MODEL 55 


Active ribosomes dynamics. 


We consider the dynamics of ribosomes, i.e. the binding ribosome-mRNA, which occurs at rate 
A3, and the polypeptide elongation step, whose elongation time o3 has a general probability 
distribution F3(dt). Note that multiple ribosomes can be active on the same messenger at a 
time, resulting in a queue of ribosomes elongating proteins. These ribosomes are referred to 
as active ribosomes. The number of proteins results therefore amplified with respect to the 
population of mRNAs. 

The point process R representing the instants of ribosome binding on mRNA and the asso- 
ciated protein elongation time can be written as 


R= M(du, dv) 1 Sta y NÝ (dz, dy), (3.1.7) 
RxR+ [u,u+v]xR+ 
where, for u € R, M x, (dx, dy) is a marked Poisson point process on R x R+ with intensity 
measure A3 dr ® F3(dy), created at time u. 
The number of active ribosomes at equilibrium can be represented by the following random 
variable 


R= JJ Liy (w=), (du, dv) (U IieciceiyucecuiaeN (dz, a) 
RxR+ RXR+ 


= f| Lir<o<r+y} R(dz, dy) 
RXR: 


where similar arguments than in Chapter 22] apply. 
The statistics of the active ribosomes at equilibrium can be obtained using the approach 
showed in Chapter [2] For this reason we skip the proofs of the following theorem. 


(3.1.8) 


Proposition 3.1.3. If the distribution of the lifetime of an mRNA is F;(dx) and the distribution 
of elongation time is F3(dy), then the expected value of the random variable R, which is the 
number of ribosomes active on mRNAs at equilibrium, is given by 


i(R) = b4A2A3 E(o2) i(o3) (3.1.9) 


and its variance var(R) can be expressed as 


+00 (—s+t)A0 __ 
var(R) = E(R) + XX264 J | / F 
0 Ry =s 


2 
+ A2A38 (1 = ô+) f er ^l(u1—u2)+(v1—v2)| II Fo (u;)F3 (vi) du; du;, 


R4 i=1 


dsF(dt) (3.1.10) 


3(u) du 


where, for i = 2, 3, Fi(x) = (1 — F;(x)) and x Ay = min{x, y}. 


Actually, we can obtain a more general result than in Proposition [3.1.3] allowing to recover 
any statistic of a function of the point process describing ribosomes. In fact, equation 
allows to consider functions depending on the specific point process. More in detail, for any 
function f : R x R+ > R integrable with respect to the point process R(dx, dy) we can consider 
R(f), which is a shorthand for 


R(f) = M (du, dv) J fle, y) NË, (dz, dy). (3.1.11) 


RxR4+ [u,u+v]xR+ 
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We now derive a theorem analogous of Theorem in which we obtain the statistics of R(f) 
at equilibrium. 


Theorem 3.1.4. Let f: Rx R} > R a non negative Borelian function, then if the distribution 
of the lifetime of an mRNA is F2(dx) and the distribution of elongation time is F3(dy), the 
expected value of the random variable R(f) is given by 


B(R(f)) = Be (on) [rude Fa(dy) (3.1.12) 
RXR+ 
and its variance var(R(f)) can be expressed as 


on | uv) duPy(av) 


2 
+ 2164 | (/ Lrscucere (0,4) dra) dsF (dt) (3.1.13) 
RxRy RXR+ 


+ X2X564(1 — 64) i eAlu-wtz—al f(u, v) f(x, y)F2(w) F2(z) du dx dw dzF3(dv) F3(dy), 


R?xR4 
(3.1.14) 
where, for i = 2,3, F;(x) = (1 — Fi(x)). 


Proof. The proof consists in a slight modification of the proof of Proposition [2.3.5] Recall that 
R(f) can also be represented as 


=f Lypy(s,,)= ylisn<a<sn +o n} f (2, YNA (dx, dy). 


nez XR+ 


Denote by E the conditional expectation E[- | (Y(t)), (Sn, O2n)]. The conditional generating 


function E [e772] can be written as 


E (II exp (sto, -» | Lfs,<2<snton, Je yN: ' (dx, a))) 
RxR+ 


nez 
= II E i (aveu f Lis, <e<sn+oz n} f MUNG (dx, w)|; 
RXR+ 


since the point processes Ny”, n € Z, are independent. The nth term of this product is, applying 
Proposition [A.0.5]to the aa Poisson point processes Ny” de? 


= 


E |exp (--/ g(x, y) NSS nay) = exp (-x / (1 = een) de Fala) , 
RxR, 


where g(x,y) = lty (sn)=1} {sn <2<sn+02,n} f (2, Y). 
Now, 


E Le |) = s fef- NE. . Wie eS E (1—e-few)) liy (s,)=1} a daraa) 
(v)] = exp {vf (1 =g Cle; aut ne 


-E jee (G(s,t)) 


(3.1.15) 
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where 


G(s,t) = x | Ly s<e<ste}lyy(s)=1} (1 - grea dzF;(dy). 
RxR+ 


In order to obtain an expression for E(R(f)) and E(R(f)(R(f) — 1)|(Y (t))), we have to take the 
first and second derivative of equation with respect to z and evaluate it at z = 0. The 
resulting formula should then be integrated with respect to (Y(t)) to obtain formulas 
and (3.1.13), by using similar arguments as in the proof of Proposition [2.3.3] (with more technical 
calculations). 


Proposition [3.1.2] as well as Theorem [3.1.5] could be generalized by proceeding to a similar 
proof as done here. We just give the general result in the case of active ribosomes, since, as we 
will see later on, we will use this result to derive formulas of mean and variance when we consider 
dilution as the main decay mechanism for proteins during exponential growth phase. 


Proteins dynamics. 


In a similar way as done for ribosomes, we can describe the protein dynamics in connection with 
the dynamics of messengers and active ribosomes. The point process P representing the instants 
of translation initiation, duration of protein elongation and protein lifetime can be represented 
as 


p= Î Mana | | õa y, Ni (der, dy, d2), (3.1.16) 
RXR+ [u,u+v]xR+ 


where, for u € R, Ny, (dx, dy, dz) is a marked Poisson point process on Rx Ry xR, with intensity 
measure A3 dr ® F3(dy) ® Fa(dz), created at time u. Here F3 is the distribution associated with 
protein elongation time and F; is the distribution associated to protein decay. 

The number of proteins at equilibrium can be represented by the following random variable 


P =i] LY (u)=1 Na, (du, dv) (I lip males sn (arty )) 
RXR+ RXR+ 
=J Liz+y<o<s+y+2}P (dz, dy, dz) (3.1.17) 
RXR+ 


where the indicator function L{3<0<r+y} is used to represent all the proteins which have started 
translation at a negative time and which have not yet finished their task at time t = 0. 

The statistics of the proteins at equilibrium can be obtained using the approach showed in 
Proposition |3.1.2|and [3.1.3] We now give the results concerning protein, but we skip proofs since 
they use similar arguments as in proof of Propositions [3.1.2] and [3.1.3] but with more technical 
computations. 


Theorem 3.1.5. If the distribution of the lifetime of an mRNA is F>(dx), the distribution of 
elongation time is F3(dy) and the distribution of the lifetime of a protein is Fa(dz), then the 
expected value of the random variable P, which is the number of proteins at equilibrium, is given 


by 


E(P) = 64A2A3E(o2)E(o4) (3.1.18) 
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and its variance var(P) can be expressed as 


var(P) = E(P) + 215 | 


RXR+ 


L{s<o} iL. lé (utstt)<y<—(uts)}f4(u) duF3(dy)| dsF2(dt) 
+ 


(3.1.19) 


+ 3384 (1 — 5) f EME EU al F'o(z) F'o(¢)Fa(a)Fa(é) dx dz dé d¢F3 (dy) F3(dn), 


RE 
where, for i = 2, 4, Fj(x) = F;([0,x]) and F(x) = (1 — F;(x)). 


The formulas obtained in this section involve general distributions for several steps of the 
protein production. However, the averages of the modeled processes at equilibrium depend only 
on the average duration of the general steps. This means that, for given conditions and given 
mean duration of the different steps, the choice of specific distributions does not affect the 
average number of messengers, active ribosomes and proteins at equilibrium. On the other side, 
the choice of specific distributions has an impact on the fluctuations of the different processes. 


3.1.2 Realistic assumptions 


In this Section we discuss the appropriate assumptions of the different processes described in 
the four-stage model. The general formulas presented in Section [3.1.1] are now specialized, by 
considering appropriate choices for the distributions in order to better describe the underlying 
process and obtain a more realistic model. 

As usual, we refer to exponential assumption when the time to produce a particular cellular 
component or its lifetime is exponentially distributed. We have already discussed and described 
the appropriateness of such assumption in Section 2.1.3) which is well suited to describe the 
biochemical reactions which requires primarily the encounter of two reactants. We first recall 
the main steps of the four-stage model with exponentially distributed duration. 


Gene activation. The gene activation/inactivation is mainly driven by repressor, which is a 
DNA-binding protein that regulates the expression of the specific protein by binding the oper- 
ator thus preventing polymerase binding. For this reason the activation step is modeled as a 
two state process, where the switching between states occur in exponentially distributed time. 
Experimental results of Goldin et al.[22] show that the exponential two-step model is a well 
suited description of the biological phenomenon. 


Transcription and translation initiation. Transcription and translation processes have sim- 
ilarities in their main steps. In particular, during transcription initiation a polymerase binds on 
the gene, followed by specific processes before the start of messenger elongation. Analogously, 
during translation initiation a ribosome attaches on a messenger and then a series of processes 
occur in order to well accommodate the ribosome and proceeds to the next step. We assume 
that the initiations of both transcription and translation to be exponentially distributed, since 
the main limiting mechanism is the successful binding of polymerase and ribosome. 


mRNA degradation. The degradation of messengers occurs by means of RNase, which locates 
and binds to the target messenger in order to start the degradation process. The degradation step 
is therefore well described to occur in exponentially distributed time, since, once a RNase has 
bound on the messenger, the translation initiation is not possible anymore (the RNase prevents 
the binding of ribosome and has eventually degraded the starting sequence). Note that there is 
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a competition between translation initiation and mRNA decay. In particular, the RNase and 
ribosome compete to bind to the ribosome binding site (RBS) in first place. If RNase succeeds, 
then it degrades the mRNA in the 5’ to 3’ direction and does not interact with ribosomes bound 
to mRNA, which can complete elongation of the specific protein, see also [37] [74]. 


mRNA elongation. Messenger elongation, despite shares common characteristics with protein 
elongation, has not been intentionally considered in the four-stage model. In fact, in the presented 
model, once the gene is active, the messenger is available for translation initiation up to an 
exponentially distributed time. We show now that this model assumption is a licit, since our 
analysis is restricted to prokaryotic gene expression and the impact of the messenger elongation 
step is here negligible. 

The mRNA elongation can be described in a similar way of protein elongation and is possible 
to derive general formulas of the variance of messengers at equilibrium in such more realistic 
case. In particular, we assume that the gene switches between active and inactive states at rates 
A; and At respectively. The polymerase binds on the active gene at rate À2 and the duration 
C2 of the subsequent messenger elongation has a general distribution G2(dt), as shown in Figure 
[3.3] The messenger lifetime ¢3 is then assumed to have distribution G'3 (dt). 


b 2 £ 
aT 9 


6 ~ G2(dz) 


Figure 3.3: mRNA elongation model. The gene inactivation (respectively activation) occurs 
at rate A; (respectively a) When the gene is in active state, the RNA polymerase attaches 
the gene at rate À2 and proceeds to mRNA elongation, whose duration @ is assumed to have a 
general probability distribution G2(dt). Messenger decay is achieved RNase and the distribution 
G3(dt) of mRNA lifetime ¢3 is supposed general for the moment. The parameters AŤ, AJ, Az 
and À3 indicate that the duration of the relative step has exponential distribution of parameter 


À. 


In order to obtain an estimation of the impact of messenger elongation step on fluctuations, 
we consider two limit cases: the case of large fluctuations on mRNA elongation, i.e. exponential 
elongation, and the case with no fluctuations, i.e. deterministic elongation. The messenger decay 
is supposed from here on to be exponentially distributed with parameter p13. 

If the messenger elongation is exponentially distributed, then the mRNA fluctuations are 
described by the formula 


Ali À + po + u3 
A + us (A + u2)(uo + u3) 


var ®) (M) = E[M] |1 + (1-464) 


(3.1.20) 
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where the superscript “E” serves as a reminder of the exponential choice of the elongation time 
distribution. On the other side, if the elongation time is deterministic and its duration is denoted 
by 72, then the variance of the number of mRNAs at equilibrium is given by 


A2 
A+ us |” 


var D) (M) =E[M]|1+(1-6,) 


(3.1.21) 


where now u2 = 1/72 and “D” stands for deterministic elongation. 

We can show that var (®) (M) < var)(M) for any choice of parameters. Moreover, simula- 
tions indicate that the variance corresponding to other suitable distribution choices are comprises 
in between the two previous formulas. For this reason we use those formulas to give an estimation 
of the impact of the modeling of messenger elongation on its resulting fluctuations. 

There is biological evidence, see also [43] and [74], that in prokaryotes ribosomes can bind to 
the elongating mRNA. In fact, since prokaryotes have no nucleus, the messenger is accessible yet 
during elongation and a ribosome can bind to it as long as the RBS sequence has been assembled. 
For this reason, the time the messenger becomes available for translation corresponds to the 
time required to produce this starting sequence, which varies from 0.6 up to 4 seconds under 
biologically relevant ranges of parameters. 

If we consider a constitutive gene, i.e. 64 = 1, then formulas and are identical 
and the elongation distribution has no impact on fluctuations. If 6, # 1, the gene dynamics 
have the strongest impact on messenger fluctuations. Analyzing the variances for different rates 
of activation/repression, with elongation time of the starting sequence in the window (0.5, 10] 
seconds and with an average lifetime of 1 — 2 minutes, the differences of the relative variances of 
the deterministic and exponential elongation is smaller than ~ 3%. Messenger elongation has a 
weak impact on fluctuations and the simplification made in the four-stage model is reasonable. In 
conclusion, since we are interested in messenger dynamics just in perspective of the production of 
proteins, the exponential assumption for the effective step of messenger transcription. Note that 
in the four-stage model we make no distinction between elongating and completed messengers. 


Remark. The simplification made in four-stage model to not consider messenger elongation is 
no more valid in the eukaryotic case since, because of the presence of a nucleus, transcription 
and translation occur in separated compartments. However, the MPPP framework can be used 
to build a model considering the specificity of eukaryotic gene expression. In particular, the 
separation of transcription and translation requires the elongated messenger to be firstly exported 
to the cytoplasm in order to undergo subsequent processing. The time to completely elongate the 
messenger as well as the time required to export it to the cytoplasm must then be included into 
the description, resulting in a slightly more complex, but still treatable, model of gene expression. 


Protein elongation. The protein elongation results in an iterative procedure in which each 
codon of the messenger chain is coupled with a particular tRNA, which adds a new amino-acid 
to the growing polypeptide chain by means of ribosome, see Figure The insertion of a 
new amino-acid requires firstly the encounter of a charged tRNA, corresponding to the codon 
the ribosome is reading, with the ribosome. The amino acid is then attached to the growing 
polypeptide chain, while the uncharged tRNA is released in the cytoplasm and the ribosome 
moves to the next codon. Since the encounter of tRNA with the ribosome proves to be costly 
in terms of time, then each event of attachment of a new amino acid can be described to occur 
up to an exponentially distributed random time. If the specific protein is composed of N amino 
acids, then the duration TO of the elongation step is the sum of N exponentially distributed 
random variables. 

We can identify K classes of codons, each class including codons with the same characteristics, 
i.e. each codon that belongs to a class k is assembled in an exponential time of parameter pp. Let 


3.1. FOUR-STAGE MODEL 61 


Tp be the time to process the Nx codons of class k, with k = 1,..., K, hence 7% is the sum of Nk 
independent exponential random variables of parameter pp. Therefore, Tk is Erlang distributed, 
Tr ~ Erl(Nx, px), where pp is referred to as rate and Nọ as shape parameter, and its probability 
density function is 


(N-D? 


where Ny is such that aan Nk = N. Consequently, the duration of the protein elongation 


TON = sn Tk can be described as the sum of independent random variables with Erlang 
distribution. 


tRNA 
(uncharged) 


ÉD tRNA 


(charged) 


(a) Ribosome binding. (b) Chain elongation. 


Figure 3.4: Protein elongation. Translation starts with the binding of the ribosome on the 
messenger (see |(a)), which assembles the polypeptide chain. Each of the four amino acids is 
carried by a specific charged tRNA (see figure [(b)}. Once the tRNA has bound to the ribosome, 
the amino acid is attached to the growing chain and the uncharged tRNA is released in the 
cytoplasm. 


Protein decay: proteolysis and volume dilution. There are two distinct mechanisms of 
proteinproteolysis and volume dilution. 

Protein decay through proteolysis is a controlled process and is mainly associated to a quality 
control of the produced proteins or to particular situations such as stress. Temperature and stress 
have a deleterious effect on proteins which are eventually denatured. Proteolysis intervenes at this 
point as maintenance and destroys possibly the damaged proteins. The degradation of protein 
via proteolysis shares a similar description of mRNA decay: a protease locates and binds to a 
protein and breaks down the chain into smaller polypeptides or amino acids. For this reason, 
the exponential distribution is the most appropriate choice for the distribution F4(dz) in the 
four-stage model. In general, proteolysis is not the main protein decay process within a cell in 
normal conditions and the typical protein lifetime is of several cell cycles. 

The protein decay via dilution is a process is a completely different process. In general, 
prokaryotic and eukaryotic cells increase their internal components in order to give rise to two 
daughter cells. The volume increase associated to cell growth is the responsible of the dilution of 
the cellular components. Bacterial protein production occurs mainly in the exponential growth 
phase, where dilution is the main cause of protein decay. The phenomenon of dilution can 
be described by a continuous deterministic process, since it is the direct consequence of the 
(continuous) increase of cell volume and the speed of dilution corresponds to the cell growth rate, 
which is a well determined parameter for fixed environmental conditions. Consider a (fixed) unit 
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of volume Vo, which contains one unity of protein production, i.e. it contains a fixed number 
of genes, the corresponding produced mRNAs and their degradation machinery and a constant 
pull of free ribosomes. We assume that cell dilution affects only the proteins produced in Vo, in 
particular this can be visualized in the following way: the production machinery is confined to 
the volume Vo, except the proteins which can continuously leak out the fixed volume, see figure 
If P(t) denote the number of proteins at time t in the volume Vo, we define the protein 
concentration P.(t) as 


Pa) © (3.1.22) 


where P(t) is the protein copy number at time t. 

Let {li },—1. oo be the sequence of the in- 
stants of birth of new proteins and denote with 
to the first time of observations. By definition of 
the sequence {li },—1 401 We have 


P(Iz) mo (3.1.23) 


k 
P(t) = PG) +1 = P.(i,)Vo + 1. 


During exponential phase of growth, the decay 
protein of protein is driven by a term of the form 
e™”t, where v is the cell growth rate. In particular, Figure 3.5: Unit of production. The di- 
if we interrupt the production of proteins at time lution affects only the concentration of pro- 


lo, then the protein concentration P.(t) decreases teins. This can be visualized as a continu- 
as follows ous deterministic leaking of proteins (yellow 


spheres) out of the volume Vo. The volume 
P(t) = P,(to)e-”Ü to) t> lp. (3.1.24) Vo contains the machinery of production of 
a specific protein. 


Using the previous visualization, v can be inter- 
preted as the speed of the continuous leakage of 
proteins out of the volume Vo, while P, measures 
the “amount” of proteins inside such volume. 


Remark. In the classic stochastic models of gene expression, protein dilution is described as a 
stochastic exponentially distributed protein lifetime. This description allows to use the classic 
tools in order to derive explicit analytic formulas of protein variance, however the use of this 
description has not been justified. In the following section we see how to couple the continuous 
deterministic description of protein dilution with a discrete stochastic production process as in 
the four-stage model. The description via MPPP allows then to derive exact formulas. 


3.1.3 Explicit formulas under realistic assumptions 


In Section we have analyzed realistic assumptions for the Four-Stage model introduced in 
Section LT We first describe the hybrid four-stage model of the gene expression, which includes 
the protein decay due to dilution. General results on protein fluctuations are derived under 
general assumptions and the so obtained model is compared to the four-stage model detailed 
previously. Using realistic biological assumptions analyzed in we then derive an explicit 
description of protein variance. 
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Hybrid four-stage model 


The modeling of protein dilution introduced in Section is now coupled with a stochastic 
description of the other processes involved in gene expression. 


From equations (3.1.23) and (3.1.24), we obtain the following linear continuous system de- 
scribing the protein concentration dynamics 


dP, 
dt (t) = —vP.(t), tE llr-ile) 
En o kalesi (3.1.25) 
Vo 
P.(1o) = P?, 


where lo is the initial time. The birth of new proteins is a discrete phenomenon and results in 
jumps of protein concentration at instants {I,},=0,....0, as expressed by equation in (3.1.23). 
The solution of the system (3.1.25) is given by 


1 CO 
P(t) = Poe” + i D Liege ™. (3.1.26) 
k=1 


Since Vo is constant, using equation[3.1.22]we can write the previous equation in terms of number 
of proteins. For this reason, for here on we will write all relations in terms of number of proteins 
P. 

We are interested in the equilibrium statistics of proteins, therefore we can assume P° = 0, i.e. 
we forget the initial condition. The system describe just the protein dynamics in the case 
of volume dilution given the jump instants {lk }x=o.…... It is possible to couple this deterministic 
description with the stochastic description of the other involved processes by writing the protein 
birth instants in terms of the dynamics of genes, messengers and ribosomes. More in detail, if 
we denote with r the instant of ribosome binding on a mRNA and with 0% the time required 
to translate such messenger, then 

lk = Trk + O3,k) (3.1.27) 


since a new protein is released into the cytoplasm once a ribosome has completed chain assembly. 

Since the volume dilution affects only protein concentration, we can reuse the description of 
the four-stage model for gene activation/inactivation, mRNAs and active ribosomes dynamics. In 
particular, the point process describing active ribosomes can be written as R = (rk +03,k, k € Z) 
and, using the functional description (3.1.11), we can rewrite equation as follows 


P=Ð age = ff Lery CHR, dy) = RP) (3.1.28) 
k=1 RxR+ 
where 
fea) = Liryc” C). (3.1.29) 


The resulting model is depicted in Figure [3.6] and is referred to as hybrid four-stage model. 
The statistics of proteins in the hybrid four-stage model can be obtained by using the results 
of Section as detailed in the following theorem. 


Theorem 3.1.6. If the distribution of the lifetime of an mRNA is F2(dx) and the distribution 
of elongation time is F3(dy), then the expected value of the variable P, which is the concentration 
of proteins at equilibrium, is given by 


UP) = E(o2) (3.1.30) 
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Figure 3.6: Four-stage model with protein dilution — Hybrid model. This model is a 
slight modification of the four-stage model, see Figure Here protein dilution is the main 
protein decay mechanism (step7) and is described as a continuous deterministic process. By 
coupling the continuous model of protein dilution with the point process view of Section [3-1-1] for 
all other processes we obtain the hybrid four-stage model. The volume dilution proceeds at rate 
v and is strictly connected to the rate of cell volume growth. The parameters Af, AJ, Az and 
A3 indicate that the duration of the relative step has exponential distribution. The distributions 
F> and F are general. 


and its variance var(P) can be expressed as 


var Hybrid(P) = (3.1.31) 


1 


3 (P) +6. | L{s<o} p Dy _ (u+s+t)<y<-(u+s)}€ " duFs(dy)| dsFa(dt) 
RxR} R2 


+ A2204 (1 — 54) J e Netnte—2—u- 21 (2) Fo(dje "Te VE dx dz dé dC F3 (dy) F3(dn), 
R6 


+ 


where, for i = 2, 3, Fi(x) = (1 — F;(x)) and v is the rate of volume dilution. 


Proof. By applying Theorem with f(u,v) given by (3.1.29), we obtain the results. 


Notation 3.1.7. From here on we use the subscript “Hybrid 4-Stage” to distinguish the formulas 
obtained from the hybrid four-stage model from the formulas of the four-stage model, tagged now 
with “4-Stage”. 


Four-stage and hybrid four-stage models: comparison of general formulas. Assume 
that dilution is the main protein decay mechanism; we want now to compare the fluctuations 
resulting of the description of gene expression in the case of dilution obtained by the four-stage 
model and the hybrid four-stag model. In particular, we will use T heorems [3.1.5] and [3.1.6] 

The four-stage model describe the protein decay as a discrete stochastic process, which occurs 
up to a time oy with distribution Fy(dz). Classically it is assumed that this step is exponentially 
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distributed, i.e. o4 ~ F4(dz) = use~”* dz. In this case, formula (3.1.19) read 


Var4-Stage(P) = 
2 


= E(P) + 2164 | L{s<o} f ly epee adie duF3(dy) dsF;(dt) 
RxRy R? 


A = à) | e-NEtM+E av RE, (2)F3(C)e™” eE da dz dé ACF (dy) F3(dn). 
Re 

A comparison of the previous formula with formula shows that the only difference 
reflects in the first addend, no matter the choices of distributions F2(dt) and F3(dt). 

This difference is strictly connected to the modeling of volume dilution and not to the more 
realistic description of protein elongation. Indeed, if we consider the three-stage model presented 
in Chapter [2] with the description of dilution given, we obtain a formula of variance as in formula 
where the first addend is replaced by 4 z [P]. In Section [3.2.4] we will analyze more in 
detail the impact of the more realistic description of dilution on protein fluctuations. 

The previous comparison of formulas and has an important consequence, 
which will be analyzed hereafter. The development of stochastic models of gene expression has 
relied in part on the comparison of the distribution of proteins at equilibrium with a simple 
Poisson model. More in detail, we can obtain a first simplistic stochastic description of gene 
expression using a birth and death process, as shown in Figure [3.7] a new protein is synthesized 
at rate À and degraded at rate u, each event occurring in exponential random time. This 


Figure 3.7: Poisson model. Gene expression is described as a simple birth and death process, in 
which the birth of a new protein occurs at rate À, while it is degraded at rate y. 


simple description leads to well-known formulas for protein mean and variance at equilibrium, 
i.e. E(P) = A/p and var(P) = E(P), since the resulting process describing proteins dynamics at 
equilibrium has Poisson distribution. 

The stochastic models proposed in the last thirty years have marked a breaking point with 
respect to this naive modeling by showing that the distribution of proteins resulting by a more 
realistic description of the underlying processes is no more Poissonian. In particular, if the 
mRNA dynamics are also considered, the variance of protein number is bigger than the Poisson 
case and, therefore, their distribution is not-Poissonian. This argument is already present in the 
early work of Berg [5], in which the author showed that the variance of proteins at equilibrium 
could be wider than the Poisson case, which is recovered in the limit when each messenger 
produces exactly one protein. 

The comparison with Poisson process has led to the introduction of a measure of the deviation 
from Poissonian behavior: the Fano factor. The Fano factor, defined as v = var(P)/E(P), is equal 
to 1 in the case of Poisson dynamics. Many other works have investigated the non-Poisson nature 
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of protein dynamics both from theoretical and experimental point of view, see [54] [70] [47] 2] [48] 
to cite few. 

Using general approach described in Section for the four-stage model, the variance of 
proteins given by formula (3.1.19) has the form 


vara gtage(P) = E(P) [1 + K] 


where K > 0, which allows to conclude that, even for general distributions, the protein fluctua- 
tions are wider than a corresponding Poissonian description. 

On the other side, if we consider the hybrid four-stage model, the formula of variance (3.1.31) 
has the form 


varHybrid(P) = E(P) $ + K| j 
where K > 0. The previous formula tells us that the fluctuations deriving from the realistic de- 
scription of protein dilution given in Section[3.1.2]may in principle be smaller than the Poissonian 
description of figure [3.7] depending on the value of K. 

This is not in contrast with the results obtained experimentally showing fluctuations larger 
than the Poisson case. In fact, the last inequality just implies that the lower bound for fluctua- 
tions in protein number is lower than Poisson. In particular, if we consider a constitutive protein, 
ie. 6; = 1, and we assume protein elongation to be deterministic and mRNA degradation to 
occur in exponentially distributed time, then 


1 À 
VarHybrid(P) = E(P) $ + >] . 
Therefore, if we consider for example a mRNA average lifetime of 2 minutes, a doubling average 
time of 20 minutes and we consider a weakly affinity of the ribosome, i.e. A3 < 0.3, then we 
obtain that the second term within brackets in the previous formula is smaller than 0.5 and 
the resulting noise is smaller than the Poisson case, which breaks down a common result on the 
subject. The parameters have been chosen in biologically relevant ranges, however no experiments 
has been performed to investigate this specific aspect up to present. It would be worth to set up 
experiments in order to explore more in detail fluctuations in this particular direction. 

In conclusion, even if for the majority of proteins we possibly have a variance larger than a 
Poisson, this could not be the case for a small bunch of them and would be worth to investigate 
experimentally. Moreover, note that if on one side we may experience a variance smaller than 
the Poisson, on the other side this does not change the qualitative behavior of the whole system, 
since the variance still results proportional to the mean value and scales with it, as has been also 
shown experimentally [2] [48]. 

Proteolysis & Dilution: user manual. The volume dilution is in general the main 
mechanism of protein decay, however in specific cases, such as in the case of unstable proteins 
see Appendix proteolysis can play a central role and this should be considered in the 
modeling choices. In this chapter we have presented two models of gene expression: the four- 
stage model and the hybrid four-stage model. We are now going to describe the cases in which 
we should use one or the other model, depending on the specific conditions under analysis. 

If volume dilution is the main mechanism of protein decay and the effect of proteolysis can be 
neglected, then the hybrid four-stage model is the more adapted description of gene expression, 
since it includes a more realistic description of protein dilution. From here on we use the subscript 
“DIL.” to refer to the results obtained from the hybrid four-stage model under the assumption 
that dilution is the main protein decay mechanism. 

On the other side, if proteolysis is the main protein decay mechanism, then the description 
of the four-stage model is more appropriate, since the decay phenomenon occurs at a specific 
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(random) point in time. As discussed in Section [3.1.2] since the limiting step of proteolysis is 
the encounter of the protease with the protein, the lifetime of a protein is well described to be 
exponentially distributed, i.e. o4 ~ Fy(dz) = 4e "4% dz. For these reasons, from here on we tag 
with the subscript “LYSIS” the results obtained using the four-stage model with exponentially 
distributed protein lifetime under the assumption that proteolysis is the main decay mechanism. 


Notation 3.1.8. From here on, the results obtained from the hybrid four-stage model in case 
dilution is the main protein decay mechanism are tagged with the subscript “DIL.”, for example 
the variance of proteins in this case is denoted with var pr, (P). 

Similarly, the four-stage model will be used only if proteolysis is the main protein decay 
mechanism and the the variance of proteins will be denoted with varzysrs(P) in this case. 

From here on we will just use the parameter u4 to denote the rate of the protein decay, 
this parameter being intended to be the rate of the exponential protein lifetime in the case of 
proteolysis or the rate of volume dilution otherwise. 


Protein elongation and explicit formulas 


After the detailed discussion of the previous sections, we derive now a formula for the variance 
of proteins in the general case in which the dilution is the main protein decay mechanism under 
realistic assumptions for the duration of elongation discussed in Section If each codon 
requires the same amount of time in order to be processed, then the polypeptide elongation time 
TN) exhibits an Erlang distribution Erl(N, p), see Figure 
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Figure 3.8: Scheme of protein elongation. The time required by each amino acid in order to 
be attached to the growing polypeptide chain is a random variable E, exponentially distributed 
with parameter p > 0. N is the total number of amino acids composing the specific protein. 


If we consider a constitutive protein, i.e. the gene is always active (6, = 1), then, by integrat- 
ing the assumptions previously discussed and applying formula (3.1.31), the protein variance is 
given by 


1, 2A3u2 p^ 


2 3-44 (N —1)! 
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where Q(n,s), n € N, is the cumulative distribution function of a Poisson random variable of 
parameter s, i.e. if X ~ Poi(s) then P[X < n] = Q{n,s). 


varpir. (P) = E [P] | 


Note 3.1.9. The function Q(h,s) is also called regularized Gamma function and is defined as 
Q(h,s) = T(h,s)/T(s), where T(h,s) is the upper incomplete gamma function and T (s) is the 
gamma function. 
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If we consider the more realistic description rT } = 5 2 Tk, where Tk ~ Erl(Nx, pk), 
pr # pj for any k Æ j and Sy Npk = N, the distribution characterizing the elongation step 
is no more Erlang, but it can still be characterized. In fact, in this case the distribution of the 
duration of elongation is the convolution of Erlang random variables and the density function is 
characterized by the formula 


S (DM 


K 
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tit. (3.1.33) 


K 
D OMR 
— p.)N+ ? 
mit- +mgK=Nk—jl=1 mi (pı Pi) bei 


m;=0 li 


for t > 0, while fr(t; N,p1,...,px) = 0 for t < 0, see [32] for further details. 

In first approach, we can split codons in two different classes: normal and rare codons. 
The main difference between normal and rare codons is the time required to assemble them, 
the latter being present at low concentrations which cause the ribosome to wait long time, 
blocking eventually the elongation of proteins. The presence of rare codons has been proven in 
experiments, but their role is still under study. Thus, using the formalism introduced, if we have 
N; normal codons, each needing an exponential random time of parameter pı to be processed, 
then the time T; to assemble the normal codons has an Erlang distribution Erl(N1, p1). Similarly, 
the time Tz needed to process the rare codons is an an Erlang random variable Erl(N2, p2), with 
p2 << pı. The probability density function of the protein elongation is now given by 


fr(t) = 
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Unfortunately, the previous formula, and therefore formula (3.1.33), cannot be easily used to 
obtain a simple analytic formula of variance such as equation (3.1.32). 


3.2 Qualitative and quantitative analysis of realistic model 


In Section [3.1.3] we have described the model that results when we consider realistic description 
and assumptions of the underlying processes. We now investigate the resulting formulas and 
quantify the impact of different choices on protein fluctuations using biologically relevant ranges 
for the parameters. The aim of the following analysis is to obtain insights on the underlying 
mechanisms and to obtain some intuition on the protein fluctuations through the use of the 
models presented. This is even more important in the present context of gene expression, because 
of the lack of reliable experimental data at present days. In fact, despite the sophisticated 
techniques developed in recent years and the expertise of biologists, the experimental outcomes 
are often not sufficiently reliable to estimate precisely the protein variance. 
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3.2.1 Biological data and model parameters 


Before analyzing more in detail the results of the modeling introduced in the previous sections, 
we devote this section to relate the available biological data to the model parameters. These 
relations we will be used in the following sections in order to analyze the models outcomes with 
respect to biological reality. 

The protein dilution is caused by cell volume growth and is therefore strictly connected to 
the cell doubling time 7, i.e. the time required to the cell to double its volume. In particular, 
if we denote with v the rate of the exponential growth, then the protein concentration P(t) 
satisfies the relation P.(t) = P®e~”', where t € [0,7) and P? is the protein concentration at the 
beginning of cell cycle. Therefore, 


(3.2.1) 


where v corresponds also to the protein dilution rate. 
Biologists commonly use the concept of half-life (t1/2), i.e. the amount of time required for 
a quantity to fall to half its value measured at the beginning of the initial time of observation. 
In particular, the messengers and protein lifetimes are usually expressed in terms of “half-lives”. 
Consider now the messenger decay, the case of protein decay via proteolysis being analogous. 
The mRNA degradation has been described as a first-order reaction, i.e. a biochemical reaction 
which depends on the concentration of only one reactant, the number of mRNAs in the specific 
case. The (average) rate law of the mRNA decay reaction is therefore 4 M(t) = —y2M(t), where 
122 is the decay rate. If we denote with Tp, the mRNA half-life, then we can obtain the parameter 
2, Characterizing the exponentially distributed messenger degradation, thanks to formula 
In(2) 


= 7, 3.2.2 
aa Thl ( ) 


The proteolysis rate u4 can be obtained using a similar formula, where instead of Ta, we consider 
the protein half-life. 

The parameters Af and AĮ can be related to the dynamics of the switching of gene state 
(active/inactive). In particular, the average time of a change of gene state is given by AT! = 
1/(Af +27). The parameter 6,, defined as 6, = At /(Af +A7), can be then tuned by observing 
the residence time of gene in active state with respect to the residence time in inactive state. 

We fix now the average number of proteins, E [P], and the average number of mRNAs, E |M]. 
In particular, we may derive the parameters Az and Ag as functions of the degradation parameters 
and of the fixed averages. The transcription initiation parameter À2 is obtained by formula 


E [M 
jee (3.2.3) 
ô+ 
The translation initiation can be now be obtained as 
mvE[P] EP] 
À3 = = . 3.2.4 
8x6, EM] QE 


We conclude this section by giving reference parameters that will be used in the following 
sections in order to analyze the derived close analytic formulas. These parameters are resumed 
in table|3.3| where we have considered the parameters corresponding to different growth regimes. 
Section shows a more detailed list of biological parameters and should be kept in mind while 
reading the following sections. 

The speed of polypeptide chain elongation changes with respect to the growth regime, see 
table [3.2] We whose then the value of 18aa/s as average protein elongation speed. The average 


70 CHAPTER 3. REALISTIC MODEL OF GENE EXPRESSION 


protein length is of about 300 amino acids, see [7], and of about 360 amino acids for E. Coli, 
using table [46]. We choose then 300 as reference value for simulations. Therefore, if we took the 
reference value of 18 aa/s as average protein elongation speed, the corresponding time of protein 
elongation is of about 16 seconds. The average messenger half life considered in simulation is 
Thi = 120 seconds. 


3.2.2 Estimation of fluctuations: deterministic elongation 


The analytic formula can be studied with the classical tools of analysis and allows to 
precisely account for the fluctuations of the number of proteins. Nevertheless, there is no hope 
to further simplify it, since the integrals which appears in have no explicit analytic 
form. For this reason, we analyze the two following limiting cases, in order to obtain an analytic 
estimation of the possible fluctuations on the protein copy number: deterministic and exponential 
protein elongation. 

Suppose that the protein elongation is deterministic, i.e. the elongation time of every protein 


is 73 = 1 /u3 P—almost surely, with u3 > 0. Leaving unchanged every other assumption as 
discussed in previous section, we obtain the formula 


(D) x 1 A3 A23 (1 — 64)(A + u2 +v) 
P) = E(P 
g z 2 i u2+v (u2 +v) +u) (A +v) 


(3.2.5) 


where the subscript “D” stands for deterministic protein elongation. Note that the previous 
formula is independent on parameter u3 describing the elongation time. Intuitively, since the 
assumption of a deterministic protein elongation results in a delay of the protein production, 
the fluctuations are not affected by this delay but rather “transferred” from the ribosomes to the 
proteins. This independence is strictly connected with the assumption of the elongation step: 
by considering any distribution law different from deterministic, this will impact the protein 
fluctuations. 

Suppose now that the protein elongation is exponentially distributed. This assumption, 


despite not realistic, will serve to give an estimation of fluctuations and allow for a comparison 
dist. 


with the formulas available in literature. Therefore, assuming o3 = E,,,, the variance of the 
number of proteins in the case of protein dilution is 
var® (P) = E(P) ; : Asis (u2 + U3 +v) 
p2 + u3)(u2 +v)(us +v) 
_AgA3(1 — ô+ Juar? P ( A+u2+u3 A+u2+v )| (3.2.6) 
(A+ p2)(v? — 03) ~ ua (u2+s)(Atus) var) (A) > 


where we used formula and the subscript “E” refers to the exponential assumption of 
protein elongation is used to distinguish with formula (3.2.5). 

It is possible to prove that varg(P) < varp(P) for any choice of parameters. More in general, 
numerical results have shown that varg(P) < vargriang(P) < varp(P) for any set of chosen 
parameter, as shown in Figure [3.9] In particular, formulas and seem to represent 
a lower and upper bound for different choices of parameters of Erlang distribution. Therefore, 
formulas and can be used to estimate the impact of a realistic description of protein 
elongation with respect to a classic approach. 

The assumption of a deterministic duration of protein elongation is actually a good approx- 
imation of the description via Erlang distribution. In fact, assume that the duration of each 
elementary process of the assembly of a new amino acid is an independent and identically dis- 
tributed random variable T;, with i = 1,...,N and N is the number of amino acids in the specific 
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Figure 3.9: Relative variance with respect to protein elongation time. The protein 
elongation time is denoted with T à ). The red continuous curve is the relative variance, i.e. 
var(P)/E?(P), in the case of exponential protein elongation as given by formula (3.2.6). The 
green continuous curve is the relative variance in the case of deterministic protein elongation as 
given by formula (3.2.5). Observe that fluctuations associated with deterministic elongation are 
independent of protein elongation time. Colored circles are obtained via simulations under the 
assumption of Erlang duration of elongation for different values of the shape parameter N, i.e. 
the number of steps of the Erlang distribution (the number of amino acids in our description). In 
particular, the simulator is used to obtain the empirical mean and variance for the specific case 
of Erlang distributed elongation. The case N = 1 corresponds to the case of a single exponential 
step and we therefore recover the exponential profile (red curve). This case would corresponds 
to the case of a protein made of a single amino acid. The case N = 400 corresponds to average 
case in prokaryotes, since the average number of amino-acids in E. Coli is in between 360 and 
400. In order to compare different curves, we have changed the time to process each amino acid 
in such a way that the protein elongation time is consistent for each vertical cut of the curves. 
Note that as long as N increases, the variance of the duration of elongation step diminishes, 
but the overall protein variance increases. In particular, the profile corresponding to the more 
realistic case N = 400 is close to the profile corresponding to the deterministic elongation. 


protein. The average length of proteins in E. Coli, expressed in terms of number of amino acids, 
is N = 360 , see table [46]. Consequently, since N is quite large, the central limit theorem gives 
the following approximation 


TN) ~mN+GVN 


where m is the average elongation time for one amino-acid and G is a centered Gaussian random 
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variable. If the variance of the random variable G is not too large, then T? is close to the 
deterministic constant mN. Formulas and show that the variance is a continuous 
functional of the probability distributions, since it is expressed in integral form depending on the 
specific probability measures. In particular, as long as the distribution of the protein elongation 
converges continuously towards a normal distribution with narrower standard deviation, the 
variance obtained by the formulas in the deterministic case will be close to the exact value. 


3.2.3 Four-Stage Model: a counter-intuitive result 


The general description of the gene expression via MPPP approach as detailed in Sections [3.1.1] 
and [3.1.2] allows to consider any distribution for few steps of the model. 

The classic approach requires each step to have exponentially distributed duration T ~ Ey, 
whose variance is given by 


1 
12° 
and thus its fluctuations are large compared to its the average value. 

As in Section we consider the “opposite” case of deterministic duration of some step 
described in order to see the impact of distribution choice on the resulting fluctuations in protein 
number. 

If we assume each step to have exponen- 
tially distributed duration except for protein 
elongation and take deterministic/exponential 180r 
distributions for this step, we obtain the 
counter-intuitive result vargxp < VarperT. 
Figure illustrates this results, here the 140} 
square root of relative variance is plotted with 


var(T) = E?[T] = 


—DET--> Doubling time =40 (min.) 
160} —EXP--> Doubling time =40 (min.) 


respect to the parameter describing the RBS Pel 
affinity between ribosomes and messengers. S100! 
This result holds also when replacing other = 
exponential steps with deterministic ones. We = 80; 
do not investigate here the appropriateness ~~ sal 


of these choices, but we want rather to ex- 
tract general information about the response 40; 
of the system with respect to deeply different 
choices. Moreover, simulations indicate that 
this result holds also for other distributions 0 
with intermediate fluctuations. In particular, 
the obtained results seem to indicate a distri- 
bution ordering with respect to the fluctua- 
tions of duration distribution. If we consider Figure 3.10: Impact of elongation distribution 
Erlang (or Gamma) distributions and Gaus- ON protein relative variance. 

sian distributions, cfr Chapter [2] then by tun- 

ing distribution parameters we can change the resulting statistics of the duration: decrease 
variance of duration corresponds to increase in the resulting fluctuations in protein number. 


3.2.4 Proteolysis vs. dilution 


Consider now a four-stage model with deterministic protein elongation and a2 ~ E, as discussed 
in Section If we assume that the protein decay occurs by proteolysis, then using formula 


3.2. QUALITATIVE AND QUANTITATIVE ANALYSIS 73 


(3.1.19) and the previous assumptions we obtain 


À A2A3(1 — 64)(A + u2 + 
varzysis (P) = E(P) |1+ Te a( ae Gea AL (3.2.7) 


Uo + Ua (H2 + pa)(A + u2)(A + pa) 


where E [P] = 64 
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Figure 3.11: In [(a)] the Fano factor, ie. var(P)/E[P], is plotted with respect to the ratio 
E [P] /E [M] and the cell doubling time varying from 20 minutes up to 120 minutes. The mRNA 
number, which affects the ratio E [P] /E [M], varies now from 1 up to the average number of 
proteins E [P] = 40. The protein elongation speed is 18 amino acids per second and the protein 
length is 300 amino acids. The mRNA lifetime is fixed at 2 minutes. The red-coloured graph rep- 
resents the Fano factor for the four-stage model with proteolysis, while the blue one corresponds 
to the dilution case as described in Section [3.1.2] In[(b)| the Fano factor is plotted as function of 
E [P] /E [M], with cell doubling time fixed to 20 minutes. Here, with respect to plot|(a)} we have 
added the curve relative to the case of discrete deterministic protein death, green dashed line, 
corresponding to the four-stage model with deterministic death. The plots are obtained using 


the analytic formulas (3.2.7) and (3.2.8). 


If on the other side we consider volume dilution as the major mechanism of protein decay, 


then, using Theorem and the formula (3.1.31), we obtain 


1 Te A3 A2A3(1 = 64)(A + uo + ua) 
2 pot pa (M2 + pa)(A+ p2)(AF pa) J’ 


VarDIL (P) = LP) (3.2.8) 


where u4 represents now the cell growth rate, i.e. 4 = v. The comparison of the two scenarios is 
represented in Figure [3.11] where we plot the relative variance for both dilution and proteolysis 
case, as function of cell doubling time 7 and of the average number of proteins produced per 
mRNA. As shown in Figure [3.11al the variance associated with proteolysis (red surface) overes- 
timates the fluctuations with respect to the surface relative to dilution (blue surface). Therefore, 
the description of protein dilution in classic models to have an exponentially distributed duration 
overestimates the variance. 
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Remark. In Figure we have plotted the Fano factor, defined as var(P)/E [P], since both 
formulas (3.2.7) and (3.2.8) are linear with respect to the average number of proteins. For this 
reason, in the limit case of small values of the average number of proteins per messenger we 


recover the first constant terms of equations (3.2.7) and (3.2.8), 1 and 0.5 respectively. 


The difference between the two mechanisms of decay is just in the first term and, in both 
formulas protein noise var(P)/E?(P) decreases as the level of expression of the protein increases. 
However, if such scaling behavior is captured by both descriptions, there is a difference, because 
of the scaling constant is now replaced with 1/2. This explains the reason of the qualitative agree- 
ment of experimental results, such as [2], with the noise scaling, but shows that if qualitatively 
classic models capture this phenomenon, the formulas might not be quantitatively accurate. 

The difference of these two approaches is not noticeable as long as the first constant term of 
equations and is smaller than the other two terms, which seems to be the case in 
many situations. However, when we consider an active promoter, i.e. 64 = 1, the last term of the 
previous formulas disappear and the difference between the two descriptions become remarkable 
as long as we consider weak affinity between messengers and ribosomes, i.e. Ag << 1, in the case 
of an unstable protein, i.e. 1/u4 smaller than cell cycle. In this case, still in a parameter window 
of biological interest, the impact of a more realistic description is substantial, especially for 
highly-abundant proteins. It would be worth verifying experimentally such conclusions obtained 
just by the theoretical model even if in parameter ranges of biological interest. 


Conclusions on protein decay step 


We have analyzed the impact of different descriptions of protein decay on protein fluctuations. 
In particular, we have shown that in the case in which dilution is the main decay mechanism, 
then the impact of a more appropriate model description, the hybrid four-stage model, reflects 
in a change in the first addend of the variance of the number of proteins. Comparing the 
hybrid four-stage model with the four-stage model we have seen that there could be significant 
differences between the two descriptions in terms of protein variance under biologically relevant 
sets of parameters in the case of non-regulated genes, i.e. 64 = 1, and it would be worth to verify 
experimentally the results obtained. 

However, in the case of regulated proteins, i.e. 6+ Æ 1, the third term in the formulas 
and dominates and the results obtained by the two models are very close. 


3.2.5 Impact of different steps on protein fluctuations. 


We analyze now the four-stage model with realistic assumptions discussed in Section [3.1.2] i.e. 
all processes are supposed to have an exponentially distributed duration except the protein 
elongation which is supposed to be deterministic and the protein decay, which is supposed to be 
driven by dilution. Recall that, in this case, the formula of variance is (see equation (3.2.5) 


1 EA A3 A2A3(1 — ô) (A + H2 + ua) 

2 č pa+pa (p24 pa)(A + p2)(A + pa) 

prs 1. EÏP] pm : pota (1—84) (A+ po + pa) 

=E |P] |= + = HE [P i 
2 E[M] pe + a Mot pa 64 (A+ pe2)(A+t pa) 


var) (P) =E [P] | 


In order to analyze the impact of the choice of deterministic protein elongation, we compare 
this model with a similar model where the elongation duration is supposed to be exponentially 
distributed and the variance is given by formula (3.2.6). 
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Figure 3.12: Impact of elongation distribution on fluctuations of stable protein with 
no promoter regulation. In figure [(a)] the relative variance for the four-stage model with 
deterministic (red curve) and exponential (blue curve) protein elongation is plotted with respect 
to the protein elongation time. The average protein number is E[P] = 400, average mRNAs 
number E [M] = 4 with average lifetime of 2 minutes, while the cell doubling time is 40 minutes. 
The green rectangle corresponds to the elongation time window of a reference protein of 325 
amino acids, the elongation time depending on the speed of elongation varying between 10 and 
20 amino acids per second (a.a./s). In figure [(b)] we plot the difference between the relative 
variances with the two choices of distribution. 


General remarks 


The deterministic protein elongation has been chosen as a limiting case and seems to act as 
an upperbound for the variance of proteins for any other chosen distribution of protein elonga- 
tion. Formula shows clearly that the variance scales with respect to the protein level or, 
equivalently, the relative variance decreases as 1/E [P]. 


The second term of the previous formula shows also the dependence of the noise on the am- 
plification factor of messengers. In particular, this term is proportional to the ratio E [P] /E [M], 
i.e. the number of proteins produced per messenger, and we expect to increase the noise by 
increasing this ratio, as confirmed by the Figure[3.13]and|[3.18b] Therefore, a strategy to reduce 
noise at fixed protein expression level would be to increase the number of messengers, reducing 
the impact of this addend. This is in agreement with the work of Ozbudak et al.[51], where the 
authors studied the sensibility of noise with respect to changes in the transcription or transla- 
tion rates by modifying genetically strains of Bacillus Subtilis. An increased transcription rate 
resulted in a sensible decrease in the noise of the corresponding protein. 


The third term in the previous equation corresponds to the noise deriving from the dynamics 
of gene regulation. This term is proportional to the ratio (1 — 6,)/6, = AJ /AŤ, which is the 
ratio of the probability of inactive gene over the probability of active one. This term seems to 
dominates the others in several cases and the introduction of a gene regulation is associated with 
noisier profiles of the protein dynamics. Since the introduction of the first model considering 
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gene activation, see Peccoud and Ycart [57], a debate has occurred in the community because the 
gene activation gives a good explanation of bursty profiles of messenger and protein numbers, 
see [61]. The observation of high intensity production followed by long quite periods, also 
called bursts, is consistent with the results obtained by the introduction of the gene regulation. 


Stable proteins 


We first analyze the case of stable proteins, i.e. 6, = 1. The main mechanism of protein decay 
is now represented by volume dilution, whose typical time scale is the cell cycle time, which 
varies between ~ 20 and ~ 60 minutes. In the case of non-regulated genes, the variance of the 
four-stage model with realistic assumptions is greater than formula (326), in fact, if 
64 = 1, 


var (P) =E IP) |3 + En | 
` 1  E[P] pm Haha = yar) 
zE [P] E E[M] u2 + pla (1 (u2 + u3)(u3 + aa) | P 


In order to analyze the impact of this step under realistic assumptions with respect to classic 
approach, we consider realistic parameter ranges. 

In Figure [8.12a] we plot the relative variance of protein number with deterministic (red) and 
exponential (blue) elongation step. For short elongation times, the two descriptions give similar 
fluctuations, while the deterministic description differs more and more as long as the elongation 
time increases. Despite the difference might seem important, the plot of the difference in relative 
variances shows that the differences is of the order of 1074 and seems not to be as crucial as 
other steps, as we will see later in this Section. 

We focus on proteins corresponding to regulated genes, i.e. 61 #1. Because of the previous 
result, we have to study the third addend in formula which can be rewritten as 


AzA3(1 — 64)(A + Ha + Ha) 
(u2 + pa) (A + u2)(A + pa) 


We restrict our analysis to biologically relevant ranges of the model parameters. The parameter 
A = Aà] + AĮ gives the order of magnitude of the time to wait in order to observe changes in 
the gene regulation activity, while the term 6, represents the percentage of activation of the 
gene. The gene regulation is obtained by means of transcript factors (activators or repressors); 
experiments show that in average each transcript factor visit the corresponding gene once each 
~ 45 minutes. Moreover each gene exhibits in average ~ 100 transcript factors, which makes the 
visit time of transcript factors on the DNA of the order of at most few minutes. For example, as 
showed by Elf et al.[T5], the Jac repressor spends almost 90% of time non-bounded and diffusing, 
with a residence time of < 5 milliseconds. The time for a single lac repressor dimer in one cell 
to find a specific operator has been estimated to be ~ 354 seconds. 

We suppose therefore for stable proteins and under physiologically meaningful conditions that 
A >> u4 and u2 > p4. In this case, reduces to the following approximated formula 


A2A3 (1 — 64)(A + H2 + Ha) og [P] (1—84) Hoa 
(Ho + pa)(A + po)(A+ pa) O A mtp 
H2 T H4 H2 H4 1 H2 + La 


which gives the following approximated formula for the protein variance 


À alta À + u2 + pa 
AT (Ha + pa) (A + pa)(A + po) 


(3.2.9) 


= E [P] 


1 E [P] La À 1 H2la 
(D) LE 7 1 
var P) ZE[|P + +E[P à 3.2.10 


3.2. QUALITATIVE AND QUANTITATIVE ANALYSIS 77 


DET--> Time activation/deactivation (s) =30 DET-> ò =0.1 


E DET--> Time activation/deactivation (s) =180 DET-->8} =0.3 


a 


Fano factor — var(P}/E(P) 


Fano factor — var(P}/E(P) 


Oe ee 7 20 Di 80 
BE 2 = = OB ae ce E 
30 ee s 30 gaT : 30 
protein / mRNA protein / mRNA 
cell doubling time cell doubling time 


(a) Impact of gene timescale. (b) Impact of regulation activity. 


Figure 3.13: Impact of gene regulation on protein variance. The Fano factor for the four- 
stage model with deterministic protein elongation is plotted with respect to the ratio E [P] /E [M] 
and with different cell doubling times, from 20 up to 40 minutes. Here the average protein number 
is fixed to E [P] = 40. The average number of mRNAs varies between 1 and E [P] /10 and their 
lifetime is fixed to 2 minutes. 

The two surfaces in figure correspond to two different typical times for gene regulation 
changes, i.e. the typical activity time Tact takes the value 30 and 180 seconds and we have the 
relation A = 1/(60 * Tact). In this plot, the activation rate is 64 = 0.1. 

The two surfaces in figure [(b)] correspond to two different rates of gene regulation activity, i.e. 
we suppose that the gene probability of activation 61 takes the values 0.1 and 0.3, where the 
lower surface corresponds to the higher gene activity. 


where the approximation concerns only the third term inside brackets. 

The third term of formula is proportional to the protein average E[P] and no more 
on the ratio E [P] /E [M]. As consequence, the impact of this term is increased as long as we 
consider highly expressed proteins, no matter the number of proteins produced per transcript. 
This term depends also on gene activation: by lowering the probability of the repressor unbinding 
ô, (resp. increasing the probability of repressor binding 1 — ô) or by increasing the time 1/A 
between gene activation /deactivation events we increase the protein variance. The results relative 
to changes in the gene activity are showed in figure [3.13] In particular, figure [3.13a] shows the 
impact of changes in the typical time of gene regulation: when regulation events become rare, 
the corresponding fluctuations increase considerably (here 6; = 0.1). On the other side, figure 
[3.13b] analyzes the impact of changes in the activity of the gene, i.e. the influence of changes of 
the probability to have activation/deactivation. If we consider low probability of gene activation, 
64 = 0.1, then we experience larger fluctuations than for a case where the gene is more active. 
Figure [3.13] shows also that the cell doubling time has an important role in the resulting protein 
fluctuations, shorter doubling time corresponds to higher fluctuations. 

We now compare the impact of gene activity and of different choices of protein chain elonga- 
tion, in order to identify the most important steps and test the impact of more realistic assump- 
tions of protein elongation on protein fluctuations. In Figure [3.14] we plot the Fano factor of 
protein number with respect to different values of gene activity, i.e. 6, = 0.1 and 6, = 0.3, for 
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the exponential and deterministic assumption of protein elongation. We consider here a “stan- 
dard” prokaryotic protein, i.e. it is composed of 300 amino acids and the speed of translation is 
fixed at 18a.a./s. Red and green surfaces in figure[3.14al represent the protein Fano factor with 
deterministic and exponential chain elongation. It is clear that a change on gene activity has 
a major impact on protein fluctuations, while the red and green surfaces are close for identical 
parameters. 


In order to analyze the impact of cell doubling time, we now fix the gene regulation at 
64 = 0.1 and we consider the curves relative to 20 and 40 minutes of cell doubling time, see 
Figure[3.14b] As for gene regulation, different growth conditions have a major impact on protein 
fluctuations with respect to different choices for the protein elongation distribution. In particular, 
while changes between deterministic and exponential elongation account for less than 3%, the 
change in cell doubling time accounts here for almost 50%. Thus, also in the case of switching 
gene status, the impact of protein elongation on the overall protein fluctuations seems to have 
a smaller impact than other steps. In particular, the results suggest that protein fluctuations 
are extremely sensitive in changes in the gene regulation and cell doubling time, which seem to 
account for a large part of the observed fluctuations. 
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Figure 3.14: Impact of gene regulation vs protein elongation. The Fano factor for the four- 
stage model with deterministic and exponential protein elongation is plotted with respect to 
the ratio E [P] /E [M] and with different cell doubling times, from 20 up to 40 minutes, where 
E [P] = 40. The average number of mRNAs varies between 1 and E [|P] /10 and their average 
lifetime is fixed to 2 minutes. In particular, gene activity takes the values 6, = 0.1 and 6, = 0.3, 
while gene regulation typical time is of 360s [I5]. 

The red and green surfaces in figure correspond to deterministic and exponential protein 
elongation. 

In figure [(b)] we plot two cuts at Tdoubling = 20 and Taoubling = 40 of the surfaces corresponding 
to O4 = 0.1. 


We conclude this section by giving approximated formulas of protein variance, which allow to 
highlight the dependence of fluctuations on global characteristics such as protein and messenger 
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averages. The last term in equation (3.2.10) can be written as 


Hola 1 
= T? 
Matha gta 


H2 


so we can reduce the third term if we have a slow cell dilution or long-lasting mRNAs, i.e. for 
small values of 44 and u2. Now since the second term in requires large values of u2 in 
order to be minimized, then a good strategy is to consider unstable messengers coupled with 
slow cell growth. 

From formula we obtain the following formula for the Fano factor 


var(P) (P) ae je Emax [P] H4 LT [P] Le ô+ H2H4 (3.2.11) 
E [P] 2 Emax [M] u2 + pa D À pat pa’ 
where 
aes 
Emax [M] = —, (3.2.12) 
H2 
Emax [P] = Ms (3.2.13) 
H2ŅH4 


are the average number of mRNAs (resp. proteins) in the case of non-regulated gene, i.e. 64 = 1. 
Note that, once we have fixed the parameters except gene activation, Emax [M] and Emax [P] 
represent the maximum amount of messenger and proteins respectively, therefore the label “max”. 
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Figure 3.15: Approximated relative variance for stable proteins. Plot of the formula 
(3.2.14) for different values of gene typical activation/deactivation time (A = AŤ + AĮ ), figure 
and for changes in cell doubling time, figure We consider here a protein with mean 
Emax |P] = 400, associated with Emax |M] = 10, the average messenger lifetime is 2 minutes and 
the gene is mostly inactive, 64 = 0.1. 


Equivalently, we can compute the relative variance 


var(P) (P) n 1 $ 1 Ha 1-64 popa 
E2[P] 2 Umax [P] Umax [M] H2 + H4 A H2 + H4 | 


(3.2.14) 
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Remark. There is little information of the time of regulation of genes for large part of the genome 
and the activity of each gene seems to have specific characteristics. However, if A & u4, then 
the gene regulation would be a rare event during cell cycle, causing abruptly changes in protein 
quantities from one generation to the other, resulting in a quite faulty regulation of the protein 
production. For this reason, we do not expect expressed proteins to show too long gene regulation 
cycles. A particular case is represented by stress-connected proteins, which are activated only in 
emergency cases. In this case, it is reasonable that the repressor is bound to the gene for really 
long times and, once activated, the corresponding protein would show a peak in production. This 
would be consistent with the results of Newman et al.[48|, where stress proteins show high levels 
of noise. 


Rare codons 


We briefly discuss the case of rare codons and their impact on protein fluctuations in the present 
context. 

Any cell uses 61 different codons, i.e. triplets of nucleotides, to encode for 20 different amino 
acids and three termination codons. As a consequence, there is a redundancy in the possible 
choices of a genetic sequence and each codon is encoded by one up to six different sequences. 
These sequences are read by tRNAs which have been charged with the corresponding amino-acid. 
It results that the same protein can be therefore be produced by many alternative sequences of 
nucleic acids. The large variability on the frequency of the use of codons in different organisms 
is the result of evolutionary forces [24]. It turns out that there is a tight connection between 
condons and tRNA concentrations, since the system has evolved in order to reach a balance 
between these two fundamental components in order to optimize the efficiency of the translation 
apparatus. 

The codon biases between different organisms have become of crucial interest since the in- 
tensive use in biotechnology of heterologous hosts to produce a desired protein. This technique 
consists to force a host organism to express, possibly at high rates, a gene which is not part of its 
genome mainly via recombinant DNA technology. However, it happens often that the protein is 
not expressed or it is expressed at very low levels. A plausible explanation of this abrupt change 
in expression level of the specific protein in hosts is that some codons used in the codification are 
rare codons in the host organism. Improvements have been done since the first use of this tech- 
nique in 1977 [30], but there are still open questions on the subject and we refer to Gustafsson 
et al.[25] for further details. 

The presence of rare codons also in wild types has arisen the question on the utility of such 
codons and their possible effect on the resulting gene expression. Using our modeling approach we 
study the impact of the presence of rare codons on the resulting fluctuations of different processes. 
In particular, we consider a 400 amino-acids protein with “standard” biological parameters, which 
is elongated in an average time of 100 seconds if no rare codons are present (dashed curves in 
figures and [3.16b). We proceed by increasing the portion of rare codons, studying the 
impact with respect to the efficiency of the rare codons, i.e. we study the impact when the 
efficiency to process a rare codon is reduced to a fraction of the normal codon, low percentages 
meaning a long time to be processed. The duration of elongation step is described as the sum of 
two independent Erlang random variables, one accounting for the normal codons, the other for 
the rare codons. In accord with the results of the previous section, when the efficiency of codon 
processing drops down, fluctuations diminish. However, when the efficiency is above the 20% of 
the normal codons, even if the fraction of rare codons if larger than 10% of the total codons, the 
fluctuations align with those of a protein with no rare codons, see figure 

Rare codons are addressed in literature to have a negative impact on the expression of a 
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Figure 3.16: Rare codons impact. Relative variance of protein and ribosome concentration as 
function of the efficiency of the rare codons with respect to normal ones, i.e. 10% meaning that 
the rare codon rate of translation is the 10% of the translation rate of a normal codon. Given 
the speed of elongation and the average number of amino acids in a protein, we can estimate 
the time required to process each codon and obtain the rate of assembly of a “normal” codon. 
The rare codons are then modeled as to have less efficient assembly rates and the curves are plot 
with respect to the efficiency of the rare codons with respect to normal ones. Different curves 
represent the results for different concentrations of rare codons within a 400 amino-acids protein 
The protein is elongated in 100 seconds if no rare codons are present (dashed blue line). 


specific gene, by showing low expression levels and possibly being responsible of bad folding 
of proteins. The result obtained is in apparent contrast with these results. In fact, as long 
as ribosomes are stuck during translation because of rare codons, all other protein types take 
advantage and can exploit common resources and rare codons seem here to create a disadvantage 
in this competition. The previous result says that, if all other conditions are fixed, the increase 
in protein translation time associated to the presence of rare codons would result in a less noisier 
protein production. 


Unstable proteins 


In the previous section we have seen that the protein elongation step has little impact for stable 
proteins. We now analyze the impact of the protein elongation step in the case of unstable 
proteins, i.e. proteins that are degraded at high rates through proteolysis. In particular, we 
assume protein lifetime is of the order of messenger’s lifetime, i.e. 1 — 2 minutes. 

If ua © u2, then proteolysis is the main protein decay mechanism while dilution is negligible. 
In this case, we consider non-regulated protein production, then the formulas reduce now to 


14 = (: (=) ) ~ var {is (P). 


vartysis (P) © E [P| Í + a > E[P] 
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Figure 3.17: Impact of elongation distribution on fluctuations of unstable protein with 
no promoter regulation. Difference between relative variances with deterministic/exponential 
elongation for unstable proteins with non regulated promoter, i.e. 6, = 1. 


Note 3.2.1. Observe that here we have used the general formula (3.1.19), since the main mecha- 
nism of decay is proteolysis, while in the previous section we used the formula relative to protein 
dilution. 


The impact of the description of the elongation step is big as we consider long protein elon- 
gation time, i.e. small values of the parameter 13. Figure [3.17] shows that the impact of a more 
realistic description of protein elongation is significantly more important than for stable proteins, 
see Figure 

The difference becomes more relevant if we consider also the promoter regulation. Similarly 
to the previous section, we may derive simplified formulas for the variance of proteins, given the 
assumption u4 % u2. The relative variance can be determined as 


var(D)(P) ~ I Pi 1 1 4 1 — 04 ua(A + 22) 
PIP] Emax [P] 2Ema [M] + (A+ m)? 
where Emax |M] and Emax |P] are the average number of mRNAs (resp. proteins) in the case of 
non-regulated gene, i.e. 64 = 1. 

In Figure we plot results relative to this situation. The plot show the surfaces 
relative to deterministic (red) and exponential (blue) as function of the number of amino acids 
composing the protein and with respect to the average number of proteins produced per mRNA 
transcript. The elongation speed is fixed to 18a.a./s (amino acids per second). The discrepancy 
of the two surfaces is bigger when we consider longer proteins or when each transcript is translated 
dozens/hundreds of times. If we consider the three cuts shown in figure it results that 
changes in the amplification factor of mRNAs/proteins change the level of noise, but there is 
little change in the relative difference between the exponential and the deterministic approach. 

Despite the impact of elongation is more relevant in the case of unstable proteins, however, it 
seems to show a sensible difference just for very long proteins, while the exponential modelisation 


(3.2.15) 
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seems to be a good approximation for short proteins. 
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three different values of the ratio E [P] /E [M]. the ratio E [P] /E [M]. 


Figure 3.18: Impact of protein elongation - deterministic vs. exponential. Behavior 
of an unstable protein for the four-stage model under the assumptions of deterministic and 
exponentially distributed protein elongation. The protein lifetime is supposed to be identical to 
the mRNA lifetime, which is fixed to 2 minutes and the protein elongation speed is 18 amino 
acids per second. The average number of proteins is 400 and the gene deactivation rate Ay = 0.02 
is twenty times greater than the activation rate. In figure [3.18alis represented the Fano factor of 
protein number with respect to protein length, varying between 30 and 3000 amino acids. The 
three curves represent the Fano factor for three different values of the quantity number of proteins 
per mRNA (P/M = 40,100,400). Fano factor of the four-stage model under deterministic 
assumption, solid line in figure [(a)| is insensitive to changes in the protein length and is always 
larger than for the four-stage model with exponentially distributed protein elongation time, 
dashed line in figure In figure are shown the 3D graph, as function also of the 
ratio E [P] /E [M], the red surface is relative to deterministic elongation, while the blue one to 
exponential elongation. 


Conclusions on protein elongation step 


In this section we have analyzed the impact of the description of protein elongation step on 
resulting fluctuations on the protein number. In Section we have seen that the more 
realistic description of elongation step to have an Erlang distributed duration is close to assume 
the step to occur in a deterministic fashion, see also Figure [3.9] For this reason, we have focused 
on the deterministic and exponential choices for the elongation step. 

Under deterministic assumption, protein shows higher fluctuations, which confirm that the 
common exponential assumption is not adapted to describe this step. However, if we look at 
the quantitative results for stable proteins in a range of biologically relevant parameters, then 
the difference between the two descriptions is small if compared to the impact of gene regulation 
and protein decay rate by dilution or proteolysis. This is slightly different in the case of unstable 
proteins, where we assumed that the decay time of proteins and of messengers is comparable. 
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In this case the deterministic elongation shows most marked difference for proteins composed of 
many amino acids. 

Observe that the analysis conducted here assumes that the time to process each codon is 
identical. This brings to have a completely equivalent description in terms of time of elongation or 
protein length. If we consider also rare codons, the impact of elongation can be more significant. 
In fact, rare codons cause the ribosome to stop elongation and wait for the specific amino acid, 
slowing down the production of the specific protein. 

In conclusion, the description of the elongation step seems to have a minor impact on the 
second moment of protein number except for the case of long unstable proteins, when we consider 
biologically meaningful model parameters. 
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3.A Reference parameters 


In this section we resume the parameters of reference for prokaryotes, which has been used as 
reference parameters for our model. 

The specific data considered here are referred to E. Coli bacterium and are mainly obtained 
from the classical reference Modulation of chemical composition and other parameters of the 
cell by growth rate by Bremer and Dennis [26]. We will further specify the reference for the 
parameters obtained from different papers. 


Table 3.1: Global parameters. We list now few global parameters for E.Coli bacteria. 


Parameters Units Value Ref. 
Proteins per transcript unitless 10 [44 
Average percent of a given transcript species % 70 [60 
occupied by a ribosome 

Average mRNA half-life min. 5to10 [0] 
Average repressor binding time ATAT s ~ 360 
Number of proteins / mRNA [P] /E[M] unitless 100 to 10000 [70] 
Number of proteins / mRNA [|P] /E[M] unitless 10 [4] 
Unstable protein half life min. 5 59 


In table [3-1] we have reported few global parameters of E. Coli bacterium. Using the data of 
Bremer we have built table which gives different cell parameters with respect to different 
growth regimes. These parameters have been used to built reference parameters to study the 
results of our model in different biologically meaningful situations. 


Table 3.2: Biological parameters. Parameters pertaining to the macromolecular synthesis 


rates in exponentially growing E. Coli as function of growing rates at 37°C. Reference Bremer 
26}. 


Parameters Units Division time 


T, 24 min. 7,60 min. 7, 100 min. 


RNA polymerases per cell 10% RNAP/cell 11.4 2.8 1.5 
RNA polymerase activity % 30 20 17 
mRNA chain elongation Nucl/s 55 45 39 
Peptide chain elongation aa/s 21 16 12 
Ribosomes per cell 10° ribosomes/cell 72 13.5 6.8 
Ribosomes activity % 80 80 80 
Distance of ribo. on mRNA nucleotides Al 85 79 


We conclude this section by giving few insights about the calibration of the model parameters 
using the biological data of tables [3-1] and [3.2] 

Protein decay may occur because of cell dilution effect or because of proteolysis, as studied in 
Section [3.1.2] The protein dilution is caused by cell volume growth; for this reason it is strictly 
connected to the cell doubling time 7 and the rate v of protein dilution satisfies the relation 


(3.4.1) 
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Proteolysis is a rare event for which we have no precise characterisation. For this reason we do 
not specify here any relation for the rate of proteolysis 4. 

Bernstein et alii [6] showed that the 80% of all mRNAs has half-lives 7) between 3 and 8 
minutes, while Yu et alii showed in [75] that the tsr-venus mRNA has a degradation time of 
1.5 + 0.2 minutes. Biologists commonly use the concept of half-life (t:,2), ie. the amount of 
time required for a quantity to fall to half its value measured at the beginning of the initial 
time of observation. In particular, the messengers and protein lifetimes are usually expressed in 
terms of “half-lives”. Consider now the messenger decay, the case of protein decay via proteolysis 
being analogous. The mRNA degradation has been described as a first-order reaction, i.e. a 
biochemical reaction which depends on the concentration of only one reactant, the number of 
mRNAs in the specific case. The (average) rate law of the mRNA decay reaction is therefore 
4 M(t) = —u2M (t), where u2 is the decay rate. If we denote with Ta; the mRNA half-life, 
then we can obtain the parameter u2, characterizing the exponentially distributed messenger 
degradation, thanks to formula 


(3.4.2) 


The proteolysis rate u4 can be obtained using a similar formula, where instead of Tp, we consider 
the protein half-life. 


Table 3.3: Model parameters. Parameters of reference used in simulations. Here the param- 
eters for a protein with average number of E[P] = 100. We consider three different mRNAs 
average, E [M] = 1,10,100, which affect the parameter À2. The average ribosome translation 
speed considered is 18 amino acids per second and the average protein length taken is 300 amino 


acids. 


Parameters Units Division time 

T, 20 min. T, 60 min. T, 120 min. 
Gene activation At s7! 1 1 1 
Gene deactivation AT s7! 0 0 0 
Transcription init. À 1072571 0.58, 5.8, 58 0.58, 5.8, 58 0.58, 5.8, 58 
mRNA half-life In(2)/pe2 107s 1.73 1.73 1.73 
Ribosome binding A3 107%s71 58,5.8, 0.58 19.3, 1.93, 0.19 9.6, 0.96, 0.096 
Protein elongation 1/u3 s 16.66 16.66 16.66 
Dilution v 1074s7! 5.77 1.92 0.96 


We fix now the average number of proteins, E |P], and the average number of mRNAs, E [M]. 
In particular, we may derive the parameters À2 and À3 as functions of the degradation parameters 
and of the fixed averages. The transcription initiation parameter Àz is obtained by formula 
z [M 
do = pM (3.A.3) 
ô+ 


where we recall 64 = At /(Af + AZ). 
The translation initiation can be now be obtained as 
mvE[P] EP] 
A3 = = ; 3.A.4 
37 ed, E[M] ee 
We conclude this section by giving reference parameters that have been used to run simula- 
tions and to study the close analytic formulas of the Four-Stage Model for different probability 
distributions and parameter choices. 
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The speed of polypeptide chain elongation changes with respect to the growth regime, see 
table [3.2] We chose then the value of 18aa/s as average protein elongation speed. 

The average protein length is of about 300 amino acids, see [7], and of about 360 amino acids 
for E. Coli, using table [46]. We choose then 300 as reference value for simulations. Therefore, if 
we took the reference value of 18 aa/s as average protein elongation speed, the time for protein 
elongation is of about 16 seconds. 

The average messenger half life considered in simulation is Tp; = 120 seconds. 

All these parameters are resumed in table [3.3] Here are considered parameters for different 
growth regimes, which have been taken to be 20, 60 and 120 minutes as cell doubling time. 


Chapter 4 


Multi-protein model 


In the previous two chapters we have investigated the production of a specific protein, focusing 
on the characterization of the fluctuations of a single protein, which is, to our knowledge, the 
case of all the mathematical models of gene expression proposed up to now. When several classes 
of proteins are considered, we have to take into account the sharing of resources, since each class 
requires a fraction of the common and limited resources of the cell. Therefore, the allocation of 
the resources within the cell has to be understood in order to improve our knowledge of the gene 
expression. Large scale experiments have been performed in recent years in order to analyze 
the fluctuations of many different types of proteins within organisms [2] [48], but these global 
experimental results have not been accompanied by stochastic models considering the production 
of many types of proteins. In this chapter, a first step in this direction is done by investigating 
the allocation of a limited number of ribosomes in order to produce different types of proteins 
by means of a Markovian representation. 

In particular, asymptotic results for equilibrium and transient behavior are obtained through 
a scaling procedure, under the reasonable biological assumption of saturation, i.e. the resources 
of the cell are limited. It is shown in particular that the number of free ribosomes, i.e. the 
ribosomes which are not elongating any polypeptide chain, converges in distribution to a Poisson 
distribution whose parameter satisfies a fixed point equation. 


Plan of the Chapter. In Section [4.1] we describe the mathematical model of production of 
multiple-proteins and we derive the equilibrium results. Section [4.2] is devoted to the derivation 
of the asymptotic behavior of equilibrium and of the probability distribution of free ribosomes in 
the limiting regime. In Section [4.3] we analyze the underloaded and overloaded case for ribosome 
charge and analyze the resulting distributions of free and active ribosomes. 


Introduction 


The characterization of the protein production have focused in the previous chapters on the 
description of the main steps which lead the genetic information to be transformed into a func- 
tional product for a single protein. However, as long as we look at the whole picture and we 
consider the protein production in its globality, interactions between different protein production 
chains take place and this significantly complexifies the resulting scenario. Many studies have 
focused on the gene regulation pathways, studying the mutual interactions of a set of proteins. 
These models, mainly deterministic, are characterized by the introduction of feedbacks. Here we 
tackle the problem from a different point of view. More precisely, experimental studies focusing 
on exploration of noise have pointed out how part of the resulting noise is not protein specific 


89 


90 CHAPTER 4. MULTI-PROTEIN MODEL 


but seems to have a global effect on all proteins under analysis. This global effect has been 
associated to fluctuations in cellular components such as polymerases, metabolites or ribosomes, 
as suggested by Elowitz et al.[16] [69]. Therefore, we focus here on the impact of limited global 
resources, ribosomes, on the production of different types of proteins. 


Competition for cell resources 


Ribosomes are a crucial player in the expression of all genes and turn out to be extremely 
expensive in terms of resources. For this reason cells seem to strictly regulate their number in 
order to optimize the expression of genes and do not waste resources. We will obtain results on 
the resulting fluctuations in protein production in both cases of under and over-exploitation of 
ribosomes, the underloaded and overloaded cases in the following. 

To each protein type is associated a specific gene and a specific mRNA type. Nevertheless, 
there are agents, polymerases and ribosomes, in charge of steps of gene expression, which are 
common to all protein types. These macromolecules can be seen as common resources in the 
protein production at cellular level and, because of their complexity, the cost for their production 
is quite high in term of resources. For this reason the number of polymerases and ribosomes is 
kept limited and strictly controlled. Protein production can thus be seen, in first approximation, 
as the result of competition between genes (resp. mRNAs) to associate with polymerases (resp. 
ribosomes). Most of the mathematical models of the literature consider exclusively the produc- 
tion of a given protein, without taking into account the limited resources due to competition 
with the simultaneous production of other types of proteins. 

In this chapter, we consider a first step in considering the competition for resources resulting 
from the simultaneous production of many protein types. In particular, we analyze a stochastic 
model of the competition for ribosomes, which will act as limiting factor. We assume that there 
are P classes of proteins, each class being characterized to have fixed concentrations A4, ..., Ap. 
The total number N of ribosomes in the cell is assumed to be constant. The ribosomes which are 
not bound to any mRNA are referred to as free ribosomes, while the ribosomes which are attached 
on messengers are referred to as active ribosomes. To each class of proteins of concentration Ap, 
for 1 < p < P, are associated K, messengers. Each messenger associated to a protein of class 
À, is said of being of class p and can accomodate a maximum of Cp ribosomes. Each ribosome 
bound to a messenger of class p elongates a new polypeptide chain at rate 4p and is then released, 
thus entering in the class of free ribosomes. To stick with biological evidence, we assume that a 
saturation condition holds, i.e. the total number N of ribosomes is significantly smaller than the 
number of ribosomes that all mRNAs would bind to at equilibrium in an unconstrained system. 

If we denote with y, the number of active ribosomes bound to a messengers of class p, then 
the total number of free ribosomes is given by N — y; —-:-— yp. A free ribosome binds at 
rate A, to a messenger of the class p with less than Cp ribosomes already attached (recall that 
any messenger of class p can accommodate at most C, ribosomes). The rate À, is an effective 
parameter including different physical characteristics, such as the affinity of a mRNA of class p 
with ribosomes and is therefore crucial in the description of the competition for free ribosomes. 
The analysis is complicated, since the vector (yp) of active ribosomes varies over time because 
of the bindings of free ribosomes and, once the production of a protein is completed, to the 
unbinding of active ribosomes. 

The main simplification in the model presented in this chapter, is the assumption that the 
number of mRNAs associated to each class is fixed and we adopt the exponential assumption 
for each step throughout this chapter. The model does not consider the decay of proteins, 
since it focus on the dynamics of ribosomes. These simplifications have been necessary in order 
to obtain a first description of the competition for common resources and to move towards a 
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systemic description of the protein production. 


4.1 Stochastic model 


We introduce now the model of production of proteins and the associated notations. The total 
number N of ribosomes in the cell is supposed to be constant. We denote with A4, ..., Ap, the 
P distinct values of concentration of proteins, for each concentration A, being associated Kp 
different types of proteins, for 1 < p < P. For any 1 < p < P and 1 < k < Kp, the corresponding 
protein (resp. mRNA) is referred to as being of class (p, k). The rate at which a ribosome binds 
on an mRNA of class (p, k) is denoted with Ap, in particular this parameter is assumed to depend 
exclusively on protein concentration. Similarly, once a ribosome is bound to a mRNA of class 
(p, k), it produces a new protein at rate up. We define the load p as 
def Ap 
Pp Llp , 
where 1 < p < P. An mRNA of class (p, k), being of finite length, can accommodate a maximum 
of C ribosomes, where this threshold depends only on the respective protein concentration. 
Let XP), for 1 < p < P and 1 < k < Kp, be the number of ribosomes attached on the 
messenger of class (p, k). In particular, the quantity 


Kp 


N 
N-X S Xpalt) 


p=1 


> 
Il 
= 


is therefore the number of free ribosomes at time t, i.e. the ribosomes which are not attached on 
any messenger. 

Remark. Note that in the present model we suppose that there is a fixed number of mRNAs 
for each class of proteins, each messenger having Cp slots for ribosomes. Each class of pro- 
tein of concentration Ap contains only one type of protein, i.e. the product of a specific gene. 
In order to consider the case L different types of proteins, it sufficed to consider L classes 
(p, mı), .--, (p, ML), hence the assumption of one type of protein per class is not restrictive. 


State space and Q-matrix 


Since we have supposed that the duration of each event is exponentially distributed, then the 
protein production is Markovian. More in detail, the Markovian process 


(X(t) = {(Xpa(t)) :1<p<P1<k<K;,} 


describes the protein production and is an irreducible Markov process with values in the state 
space 
P Kp 
S= À (typ, LEE) : ope €N,0< ape <Cp and S me SN 
p=1l k=1 
Because of our assumptions, the Q-matrix Q = {q(x, y) : x,y € S} is given by, for 1 < k < K 
and x = (tpn) € S, 
nee = rx), ftp +1< Cp, (4.1.1) 


q(x, £ =. €p,k) = Hp, if Lp > 0, 
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where 


P Kp 
a =N- D > tpt 
p=1k=1 
is the total number of ribosomes and e,, x is the unit vector associated to the coordinate (p, k), 
i.e. 


PB k) = ns = 


z 1 ifp=pandk=k 
0 otherwise. 


This class of models is related to stochastic models of communication networks like loss 
networks. See Kelly for example. The difference here is that the analogue of the “input 
rate”, i.e. the rate at which free ribosomes binds to mRNAs, depends on the state of the process 
through the variable r(x). Another class of related models are the Gordon-Newel networks where 
a fixed number of customers/ribosomes travel through the different nodes of the network with 
some routing mechanism. See Gordon and Newel [23]. 


Invariant distribution 


We derive now few results concerning the invariant distribution. 


Proposition 4.1.1 (Stationary distribution). The Markov process (X (t)) is reversible and its 
invariant distribution m is given, for x = (x,») € S, by 


1 1 LR 
n(x) = ZrO llle (4.1.2) 
p=l1k=1 
where, for 1 = p S P, Pp _ Ap/ Lp; 
P Kp 
r(x) =N- 5 Tp,k 
p=1k=1 


is the number of free ribosomes and Z is the normalization constant. 
Remark. Note that, when (X(t)) is in state x = (x,,4), the quantity 

Hp(Lgep1>0} + lep 2>0} Ft Lop >0}) 
is the total rate of production of proteins with concentration Ap. 


Proof. One has only to check that, see [84] Theorem 1.3, page 6], for any 1 < p < P and 
1 < k < Kp, the relation 


m(x)q(x,2 + ep x) = rx + e,r)g(r + ep x, 2), 
holds, i.e. that 
rx) Apr (x) = (x + ep k)Up 
holds for any x € S such that x + ep, € S. In fact, the previous relation is equivalent to 


P Kp 
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Now, 
P Kp = 
rte) =N- >> (xg “her ñ)) = r(x) — 1, 
p=1k=1 
and 
P K 2 P K, P K 
Ts prep, k( Pik ke +1 T= Ts k 
ME = TT IT" = e L 5”" 
P=1 k=1 p=1 k=1 P=1 k=1 
PEP k£k 


and the proposition is proved. 


A direct consequence of the previous proposition is the following proposition, which expresses 
the invariant distribution in terms of geometric random variables. This representation will be 
useful to derive asymptotic results in Section 


Proposition 4.1.2. If, fori < p< P and1 < k< Ky, (Gr) is an i.i.d. sequence of geometric 
random variables with parameter pp on the state space {0,...,Cp}, then the distribution of the 
number R of free ribosomes at equilibrium is given by, for0<n<N, 


Kp 


P 
11 
P(R=n) = za XS Gx=N-n|, (4.1.3) 
` p=1 k=1 


where Zp is the convenient normalization constant. 


Proof. Note that 


=P |(Xpk = Tpk 1 SPL P,1 <k < Kp), X tpk=N-n 


pk 
Now if G;,8 is a geometric random variable on {0,...,C,} of parameter 1— pp, for p = 1, ..., P 
= k = 1, ..., Kp, then P [Gp = m] = ae which, together with formula (4.1.2), 
gives 
ie ee ee 
P(R=n)=—-—P Gok = N — 
( my Zrn! > m pk nl 


since r(x) =n. 


The expression of the invariant distribution seems satisfactory since it gives an explicit rep- 
resentation of the equilibrium of the system. Nevertheless, because of the large size of the state 
space S, the partition function Z turns out to be hard to estimate and, consequently, it is difficult 
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to obtain the interesting characteristics of this system. In the following section we consider a 
scaling approach for this model by letting both the number of free ribosomes N and the number 
of messengers go to infinity with specific relation. Through this scaling method, we will obtain 
a qualitative description of the system and, in particular, the rate of production of proteins. 


4.2 Asymptotic Behavior of Equilibrium 


From now on, we assume that the number N of ribosomes is large and that the number of 
messengers Kp associated to class p of proteins with a fixed concentration Ap, 1 < p < P, scales 
with N in the following way, 


KN = NB, + (VN), (4.2.1) 


in particular the sequence (KY /N) converges to 8p. A superscript N is added to the various 
quantities related to the production of proteins when the cell contains N ribosomes. Note that 
forl<p<Pand1<k< KY, the number of ribosomes that can be attached to an mRNA of 
class (p, k) is still bounded by the quantity Cp. 


Saturation Regime 


Proposition shows that, at equilibrium, given that the number of free ribosomes is n, the 
state of the process (X, £(t)) has the same distribution as (Gp) where Gp, is a geometric 
random variable with parameter p, restricted to the state space {0,1,...,Cp}, 


-p 
P(Gp,k = m) = Pp aiT 
1 — pp 


0< m< Cp, 


conditioned on the event 
KN 
P P 


An = X S Gor = Nan 


p=1 k=1 


Since ribosomes are costly macromolecules, their number is kept as small as possible provided the 
performance of the cell is maximum. In fact, the total number of ribosomes is the result of the 
trade-off between the efficiency of translation, obtained with a larger number of ribosomes, and 
the impact in terms of resources which derives from the production of ribosomes themselves. For 
this reason, a typical biological assumption is that the total number of ribosomes is much smaller 
than the mean number of places available on mRNAs. Therefore, it is reasonable to assume that 
the average number of ribosomes that would be active on the messengers in an unconstrained 
case is bigger than the total number of ribosomes N, i.e. 


P 
SK E(Gp1) >N 
p=1 
holds, i.e. 
P Cp+1 , Op: 
KN CpPp (Cp t 1) pp +1 N 
ss p Pp DFI > 
p=1 (1 — pp)(1— pr" ) 


so that the probability of event 4, is small for all n > 1. 
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This saturation condition will hold for N sufficiently large if the following relation is satisfied 


P Cp+1 C 

Cppp® — (C. 1)p,° +1 
X BpPp pPp ( EU + > 1, (4.2.2) 
p=1 (1 — pp)(1 — pp” ) 


from relation (4.2.1). 


Note 4.2.1. Note that, if we consider the unconstrained case for the a available slots on mRNAs 
for ribosome bindings, i.e. we remove the constraint that each mRNA of class p can bind to at 
most Cp ribosomes, then, with similar arguments that lead to relation (4.2.2), we obtained the 
saturation condition 


D ee À (4.2.3) 


In the case p <1, then the previous condition is satisfied whenever (4.2.2) holds. 
One derives the asymptotic distribution of the number of free ribosomes at equilibrium in 
this limiting regime. 


Theorem 4.2.2 (Free ribosomes limiting regime). Under the limiting regime with the 
saturation condition (222), as N goes to infinity, at equilibrium the number of free ribosomes 
converges in distribution to a Poisson random variable with parameter y(C) where y(C) is the 
unique non-negative solution y of the equation 


1 


P Cp+l, C,+1 Cp, C 

C P Gel pt] 
J BpPpY php # (Cp )Pp"Y = 1, (4.2.4) 
pal 


(1 — ppy)(1 — pp? + yCrt1) 
C = (Cp) and pp = àp/ Hp- 


The proof of Theorem is given below. 
We first introduce a new representation allowing to rewrite the number of free ribosomes and 
few connected results. For N > 1, define Sy as 


N 
P Kp 


SN=N-S Y Gpr, 


p=1 k=1 


where Gp, ~ Geo(pp), with space state restricted to {0,1,...,Cp}. 
From Equation (4.1.3) of Proposition |4.1.2| the distribution of Ry the number of free ribo- 


somes can be expressed as 


P(Ry =n) = >——P(Sy =n), (4.2.5) 


where Zr, is the normalization constant. 

For a fixed n > 0, we first derive the asymptotic behavior of the sequence (P(Syj = n)) by 
using a change of probability distribution similar to the one used in the proof of Bahadur-Rao’s 
Theorem, see Dembo and Zeitouni [13]. A strengthened version of the local central limit theorem 
is also an important ingredient. This can be seen in fact as the probabilistic analogue of a saddle 
point method applied to the series 


Kp 


P 
S Mie 
1 


2=(2p,R)ES P=1k= 
r(x)=n 
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see Flajolet and Sedgewick [17] chapter VIII] for example. 

The advantage of the probabilistic formulation is that the technical framework is significantly 
lighter, due to use of a powerful central limit theorem in fact. 

For @ > 0, define S$, the random variable whose distribution is defined by, for a non-negative 


function f on N, ; 
E(f(Sy)) = an) 


with dv (0) = E(exp(6Sy)). By taking for the function f the generating function, i.e. f(n) = e”” 
for 7 > 0, we get that 9e can be represented as a sum of independent random variables 


E (f (Sn) e**), (4.2.6) 


P Ko 
SK=N-Y Y Gr, (4.2.7) 
p=1 k=1 
where, for 1 < p < P, (G k k > 0) is a sequence of geometric random variables on {0,1,...,C,} 


with parameter poe”. In fact, since the moment generating function of the r.v. Gp,k is 


1—pp 1- (Cea ae 
1—pprt l= ppe™ 


7 


Berra] = 


and since G, 4 are i.i.d., then 


N 0 
J [e"s* = 1 R ferry] = en y Le penes] 
n (0) on (0) a 
= eN II L= pac? 1— (pype (1+9))Crt1 
pi 1— (pe) OH 1 — ppe D) 


N 
which is the moment generating function of the random variable N — = = G9, = 
From (4.2.7), with f(n) = Lyyznye 7, for n > 0, we obtain 


p _ 
sh Oa [Tessa : 


F, [hise >ne 


therefore 


P [Sv = n] =E [Lisw>n)] = on (8) E (rasg ™) 


+00 
on (0) f Pe UP (n < S$; < u) du. (4.2.8) 


In fact, let X denote a positive random variable with p.d.f. fx and let C > 0, then 


u=C C u 
n. f de ™ ( f Linsey fx (6) ae) du 


u=C 


c 
f Le "fx (u) du = g(u) 


C 
+f de "Pp [n < X <uldu, 
0 


where g(u) = e~% i. Line} fx (€) dé. Now, since g(0) = 0 and lim, +. g(u) = 0, by letting 
C — +00 we obtain the relation (4.2.8). 
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From relation (4.2.8), the asymptotic behavior of (P(Sw > n)) can be obtained through limit 
results concerning $%,. We will show in Lemma that 0 can be chosen so that the average 


of S$, is 0 and then, due to the representation in terms of independent variables, a central limit 
theorem gives the desired asymptotics. 


Lemma 4.2.3. For any N > 1, there exists a unique Oy such that (SN) = 0 and the se- 
quence (On) converges to —logy(C), where y(C) is defined by Equation (4.2.4), furthermore 
E((S¢*)?) = o? N + o(N), where 


FC (Cy — 1)y2 — 2(Cp + 1)(Cp — yp + Cp(Cp + 1)) — 22 


(1 — yp)? — yp? **) 


P ys” 
P 
o= |) Bp 
p=1 


with, for 1 < p < P, Yp = ppy(C). 


Proof. The function ++ ġy (0) = E [e?S^] is strictly convex with y (0) = 1 and 


N 
OEI =N- EKE LG] =N (1- et.) 


P 


P Cp+1 —0(C,+1) Cp —0C 

_o CpP eet!) — (Cp + 1)op e "r +1 

=N (1- > Boppe PE = D z ree to(VN). 
p=1 (1 — ppe®)(1 — pp” € pt1) 


The saturation condition (4.2.2) shows that, for N sufficiently large, ¢'y(0) < 0. Now, since 
ġn(0) = 1 and limy_ +o On (0) = limp 44.00 NE | e7? Epe Gr 


On such that # (8x) = 0, where the subscript stresses the dependence over N, such that ¢' (0N) = 
E [Swe?vSx] = 0. Therefore, from (4.2.6) with f the identity, E SÎN) = 0, which gives 


= +00, there exists a unique 


Copp? e-9x (Cr+) = (Cp J 1) pp? e~ 9NCe +1 _ 


(1 — ppe®N )(1 — pp? ten (Cr+) 


0, 


P 
(S$) =N = D KE pes 
p=1l 


hence, with the above expansion, one gets 


). 


P = Chpt le-0n(Cr+1) — (Cp + 1) pp? e~ ONC +1 1 

1- >. BpPpe —0 Cpt! On (Cp+1) = CF 
p=1 (l= ppe™’N)(1 — pp” es PR hy) 

The previous result shows that, for N sufficiently large, exp(—Oy) is in any arbitrary small 

neighborhood of y(C). This gives the desired convergence of (Oy). The other expansion is 

obtained with direct, straightforward, calculations. 


Proof of Theorem[.2.4 From the expansions related to the central limit theorem, see Gnedenko 
and Kolmogorov Theorem 1, page 213], one gets 


P Sy < x : a en’ /? dy 
oVN ` V2 Joo 


sup 
zER 
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where Q(x) = a(1 — x?) for some fixed constant a > 0 and L is the periodic function, L(x) = 
[x] — z + 1/2. 

By plugging this estimation in the integral of the right hand side of Equation with 
0 = On, one gets that 


+00 
1 Ove ON“ P (n < Ss. < u) du 
0 


7 Î S / er o : 
= — e” /* duĝye °X" duto (=) 
27 a n/(oVN) VN 
1 +00 f aie 2N) 6 1 ) 
=> ETT NET dune "^" du +o | —— 
l : y N 
1 -ð 1 ) 
= e N" +0 ; 4.2.9 
OnoV2nN (5 oo) 


and this estimation is uniform with respect to n > 0. 
Now, by using equation (4.2.8) and the previous estimation we obtain 


e oN” 1 —Onn o(1/VN) 
P[Sn > n] dnoVanN +0(Fy) oe 


_ = 1/ VN 
P [Sn > 0] 1 +o( 1 ) 1 + 20/0) 
OnoV20N VN 1/VN 
which gives, therefore, 
š P SN Pa n) —nOn 
vito UP |P(Sy>0)° |” 
and, consequently, 
: P(Sy =n) —nOn —O0n\| _ 
ES FG =O) 8 OS =O 


If we plug this uniform asymptotic result into the expression (4.2.5) of the distribution of the 
number of free ribosomes, given that the normalization constant Zr, is 


5 aP(Sw =k), 
k 


we obtain 

Te 
n! P 0 

P [Rn = n] = 1 TSn A à 
2 ken À PISw S01 


The limit result of Lemma concludes the proof of theorem. 


The fixed point equation has a simple intuitive explanation. First one notices that the 
number of free ribosomes evolves on a very rapid time scale of the order of N, since the number of 
attached ribosomes and the number of free places for ribosomes on mRNAs are of this order. In 
fact, because of equation we have that the total number of available places for ribosomes 
grows as N and, from Theorem [4.2.2] since the free ribosomes are Poisson distributed, we derive 
that the total number of attached ribosomes is of this order as well. The number of free ribosomes 
is therefore quickly at equilibrium, let y(C) be its average. As an approximation, for a given 
class (p, k) of proteins, provided it is less than C, and positive, the number of ribosomes attached 


4.2. ASYMPTOTIC BEHAVIOR 99 


to an mRNA increases by 1 at rate y(C)A, and decreases by 1 at rate up. At equilibrium, one 
gets that the average number of ribosomes attached is 


Cppp? (C)+ — (Cp + 1)pp?y(C)@ +1 
(1 — P(O) — pp? y(C)Cr #1) 


since, by supposing that the free ribosomes have reached equilibrium, the resulting process 


is a birth-death process and, therefore, the stationary distribution is geometric with parameter 

At) = ppy(C), where the space is restricted to {0, 1, ..., Cp}. Therefore, the previous formula 
P 

corresponds to the average value of a geometric random variable Gs, p of parameter ppy(C) with 

values in {0,1, ..., Cp}- 


As consequence, the average total number of attached ribosomes is given by 


I(Xp,x) = Pp (C) 


P 
XO KN E(Xpx)- 
p=1l 


This quantity must be of the order of N (few free ribosomes) which gives the desired fixed point 
equation for 7(C). 


We now look at more detailed characteristics of the protein production process, namely the 
number of ribosomes attached to mRNAs of class (p, k), with 1 < p< P and 1 < k < KÄ, 
at equilibrium. Note that, since all the proteins with specific concentration have the same 
parameters, for 1 < p < P at equilibrium the random variables Xp, for any 1 < k < KY have 
the same distribution. The following theorem gives a limit result for the distribution of these 
variables as N gets large. 


Theorem 4.2.4. Under the limiting regime with the saturation condition (4.2.2), as N goes to 
infinity, at equilibrium 


lim (X7 


NE prl <p<P)= (Yp, 1<p< P) 


for the convergence in distribution, where Yp, 1 < p < P are independent random variables and 
Yp has a truncated geometric distribution on {0,1, ..., Cp} with parameter ppy(C) where y(C) 
is the unique non-negative solution of Equation (4.2.4). 


Proof. The notations of the previous proof will be used. From Equation (4.1.2) for the equilibrium 
distribution, we obtain, for any non-negative function f on NP, 


E (f (Xfi 1 < p< P)) = 5- 5 —E (f (Gas: Gp,1) Np (a, nt); (4.2.10) 


where, for 1 <p<Pand1<k< KN , Gp,k is a geometric random variable with parameter pp 
restricted to the state space {0,1,...,C,}, and Zy is the normalization constant. If (mp) € NP, 
we have that 


P [(Gpa) = (amp). 1((Gpa)) = N] = bn ONE (iroyin (ro) ent À ) 


where, as in the proof of Theorem (Gy) are geometric random variables with parameter 
Pp exp(—On). 
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As a consequence we obtain 


On SON 


= Ty (G9x )=(1mp) r((G2X)) =n} 


=P (Yp) = (ME (Len, -nthe Er ll), 


where ||m|| = mı +---+ mp and the bar S and G is a short notation to specify that all the 
components with an ae k = lin SÎN and (Ge ~,) are removed (recall that Sn = N—5:,, p GSN): 
In fact, | 


a oon iE 

| EL (GEN) =(np) (GEN) =n} © 

= P[(Gp%) =(m)] 8 E {r((apy))=n} es 
= r[(GX) = m)]E es 


=) =6 
where the last equality comes from the fact that r (Gx) =n + ||m]| and Sf” = Sy" — Iml]. 
From Estimation (4.2.9), one gets that 


5 1 1 
7 —On(Sn—||ml| | — —Onn(4 _ pN eau 
) CRT ) - dno Jia {—-e")+o (=) , (4.2.11) 
holds uniformly in n and m. 
By plugging relation (4.2.11) in equation (4.2.10) with f(x) = 1g(2,)=(m,)}, We obtain 


im PAM) = (mp) = P (Yp) = (mp) (4.2.12 


Goa) = (m) 


The theorem is proved. 


The above theorem is actually a mean-field result. In the limit any finite subset of components 
of (X A g) converges in distribution to independent random variables. 


4.3 Analysis of fixed point equation 


We focus now on the possible biological r of the Theorem [4.2.4] The key parameter 
(€) is the solution of the fixed point equation (4 , which is characrerized by complicated 
expressions involving vectors of the loads (pp) Mri = the capacities (C,). A simple and tight 
approximation of y(C) is now presented. In particular, it is shown that, for a wide range of 
values of the p,’s, the solution 7(C) actually depends on the vector of capacities C = (Cp) in a 
simple way . 

The main ingredient of the estimations for y(C) is the simple following fact. If G is a 
random variable whose distribution is a truncated geometric distribution with parameter p on 
{0,1,...,C}, if p < 1 then, provided that C is sufficiently large, G is close in distribution to 
a plain geometric distribution with parameter p and, in particular, its expected value is close 
to p/(1 — p). In the case p > 1, then G can be expressed as G = C — G, where G is a 
truncated geometric distribution with parameter 1/p < 1. Explicit bounds on the accuracy of 
the approximations of 7(C) are also provided. 
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4.3.1 The underloaded case 


In this section, without loss of generality, we assume that pp < 1, for all 1 < p < P. The main 
result of the following Lemma is the fact that the quantity y(C) is independent of C. 


Lemma 4.3.1. Under the conditions pp < 1, for all 1 <p < P, and the saturation condition 
(4.2.2) if yo is the unique solution of the equation 


__PpV0 _ 
3 B =], 4.3.1 
Le Pp ( ) 


then the solution y(C) of the fixed point Equation (4.2.4) is such that yo < y(C) and 


1 CD 
MC) - ol < <p n g (Cr + Dee” Jee (4.3.2) 
bi BpPp p=1 1— pp 


with C = (Cp). 


Proof. For x > 0, define the implicit function ¢(a) as the unique solution of the equation 
Beer ee = 2, (4.3.3) 


then yo is simply (1). Note that, since we restrict to x > 0, then ọ(x) > 0. In fact, if for 
some ğ > 0 we had ¢(%) < 0, then we would obtain 0 > D- j PotD = = £ > 0, which proves 
our statement. Moreover, we have that @(x) < 1/p, for all 1 < p < P, since, as long as ¢(z) 
approaches 1/9, for any p, then the left had side of diverges to +00. In summary, for any 
x > 0, we have a unique solution @(x) of equation (4.3.3), such that 0 < @(x) < 1/pp 

It is easy to show that, by Implicit Function Theorem, the function ¢ is differentiable and 


Pr Pp 
(1 — $(@) pp)? 


S 
Mr 
| 


(4.3.4) 


p=1 
We have, in particular, that 
0 < d(x) < 1/(B1p1 + Bapa +--- + Bppp). 


In fact, since ¢(x) > 0, then 1 — p(x)p, < 1 and, therefore ¢'(x) es BpPp < 1. The other 
inequality is straightforward from the formula (4.3.4). 
For 0 < y < 1 and c > 0, the relation 


c+1 c 
ey {et Dyed ee oe | 
Ie =1-(c+1)y FE (4.3.5) 


holds. 
Since (4.2.2) holds, then the existence of y(C) is assured. Equation (4.2.4) can then be 


rewritten as 
(Ppy(C)) t! 
1— (P(C) +1 }” 


P 
C) = (: + 5 Bp(Cp + 1) 


p=1 
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by using relation (4.3.5) together with the definition (4.3.3). This proves, in particular, that 
yo < 7(C). Since yo = (1) and the previous relation, we obtain by the mean value theorem 


joer 


ne) pls SA Cp +1); DE JOH Eher 
py 


=l 


and therefore Relation (4.3.2), since 0 < (xz) < 1/pp, for any x > 0 and for all 1 < p < P. 


The above lemma states that if the right hand side of Relation is sufficiently small 
then the simple constant yo is an accurate estimation of the parameter y(C). Since the pp’s are 
< 1, this approximation will hold in the case where the values of the Cp’s are not too small which 
is a reasonable biological assumption. One gets therefore the following proposition. 


Proposition 4.3.2 (System with only Underloaded Classes of mRNAs). Under the conditions 
Pp < 1, for all 1 < p < P, and the saturation condition (4.2.2), then, under the limiting 


regime (4.2.1), at equilibrium as N goes to infinity, 
1. the distribution of the number of free ribosomes is Poisson with parameter yo + o(h(C)), 


2. the distribution of the number of ribosomes attached to an mRNA of class (p, k) is geometric 
with parameter ppyo + o(R(C)). 


where yo is the unique solution of the equation 


pi — PpV0 
and 
P Cp+1 
(Cp +1) pp” 
No) =y g GE 
p=1l P 


As a consequence, under the assumptions of the above proposition, the production rate of 
proteins of class (p, k) is given by 


Apo 
=> +o(h(C)). 
r toO) 


4.3.2 The Case of Overloaded mRNAs 


The case where at least one of the p, is strictly greater than 1 is now investigated. Without loss 
of generality, we assume the ordering pı < p2 < +: < pp and we suppose that pp > 1 holds. Let 


L = sup{p > 1 : pp < 1}, 


with the convention that sup{} = 0. For simplicity, it is assumed that none of pp is equal to 1. 

With similar arguments then in Section concerning the truncated geometric random 
variables which describes the system at equilibrium, if p, < 1, then the truncation of the geo- 
metric random variable G at Cp has little impact in the analytic expressions with respect to a 
plain geometric random variable. If conversely p, > 1, then we may express the random variable 
G as Cy — G, where G has a truncated geometric distribution with parameter 1/p,. 
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The saturation condition can now be written as 


L 
2 bT 
p=1 


~ 3 B (cp - —) + (€) > 1, (4.3.6) 


p=L+1 


where 


(CRD Cp+1 
--X Br! sore ees cn 
E p=L+1 
The following proposition gives a simplified version of the results of + this case. 


Proposition 4.3.3 (System with Overloaded Classes of mRNAs). If, for some 1 < L < P, one 
has pı < p2 S++- pr < 1 < pr4i < pp, then if 


Dart n > CE =) >: (4.3.7) 


p=L+1 


then, under the limiting regime (4.2.1), at equilibrium as N goes to infinity, 
1. the distribution of the number of free ribosomes is Poisson with parameter yo + o(h(C)). 


2. For 1 < p< P and k > 1 the number Xp, of ribosomes attached to an mRNA of class 
(p,k) is such that 


e ifp<L 
Xp,k is a geometric r.v. with parameter ppyo + o(h(C)); 
e ifL+1<p<P 
Cp — Xp, is a geometric r.v. with parameter Yo/Ppp + o(h(C)), 


where 1/pL+1 < Yo < 1 is the unique solution of the equation 


Da He + S CE —) =1 (4.3.8) 


Poo pT Ppyo — 1 


and 


(Cp + Dp “a Cp+1 
Der a De ot 
> p=L+1 
Proof. The existence of a solution 1/pp < yo < 1 to Equation (4.3.8) is a consequence of Con- 
dition (4.3.7) and therefore that yopp > 1 for L +1 < p < P. The rest of the proof follows the 
same arguments as in the proof of the above lemma. 


Coming back to the proof of Theorem where a change of probability was used, for 
1 < p < P, a geometric random variable with parameter pp is transformed with a reduced 
parameter y(C)pp, it should be noted that this change of distribution does not affect the set of 
overloaded classes under the assumptions of the above proposition, i.e. the 1 < p < P such that 


V(C)pp > 1. 


Appendix A 


Mathematical tools 


A.0.3 Marked Poisson Point Processes 


The main results concerning marked Poisson point processes (MPPP) are briefly recalled in 
the present section. For more details we refer to Kingman [38] and Chapter 1 of Robert [65]. 
Throughout this section H is the space R? for some d > 1. 


Definition A.0.4. If À > 0, (dx) is a probability distribution on H, a marked Poisson process 
on Ry x H with intensity Adu® (dz) is a sequence Ny = (tn, Xn) of elements of Ry x H where 


e (tn) is a (classical) Poisson process on Ry with rate A. 
e (Xn) is an i.i.d. sequence with values in H and whose distribution is (dx). 


The sequence M can also be seen as a marked point process on R4 x H, i.e. if f : R} x H > 
R+ is a continuous function then 


MN = | Plt BN (du, de) = 3° ftn Xn). 


In other words Ny can also be seen as a sum of Dirac masses at the points (tn, Xn). 


Laplace’s transform of MPPP 
The following important proposition characterizes marked Poisson point processes. 


Proposition A.0.5. The point process Ny = (tn, Xn) is a marked Poisson point process with 
intensity À du ® p(dx) if and only if the relation 


[exp (-NA(f))] = exp (-> f = (1 z gra) du nas) (A.0.1) 


holds for any non-negative continuous function f on Ry x H. 


The left-hand-side of Equation (A.0.1) is usually defined as the Laplace transform of Ny at 
f. This quantity determines completely the distribution of any marked point process. 


For € > 0, by replacing f by ¿f in Relation (A.0.1), one gets an expression for 
E [exp (-EN)(f))] - (A.0.2) 
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If one differentiates formula (A.0.2) with respect to € and sets € = 0, from (A.0.1) we obtain 


LIN) (f)] =E pl _ fede da) = À a cn fh) du u(dx). (A.0.3) 


MPPP and Martingales 


Be M a Poisson marked point process, i.e. My is a Poisson point process on R x H with intensity 
measure À dx @ v(dy), where v(dy) is a probability measure on H, locally compact metric space. 
Then we denote with F; the o-field generated by the random variables A x U, where A is a 
Borelian subset of (co, t] and U is a Borelian subset of H, i.e. 


Fi = 0 (Ny((co, t] x U):a,b <t,U € B(H).) 
We have the following result, see Robert [65] Proposition A.9, page 353]. 


Theorem A.0.6. If X(t,x) is a cádlág process adapted to filtration F, measurable and square 
integrable with respect to to v(dx), then the process 


ott) = f| Xey) Watas dy) — Ads (ay) (A.0.4) 
(0,t]x H 
is a square integrable martingale, whose increasing process is 


I(t) = // X?(s,y)A dsv(dy). (A.0.5) 
( 


0,t]x H 


Appendix B 


Biology 


B.1 Biological Mechanisms 


B.1.1 Gene activation 


Gene activation is the process which allows a gene to be expressed at a specific time. The way 
this activation may occur varies a lot from gene to gene and from organism to organism. The 
main mechanisms causing gene activation are the dissociation of a repressor, the association of 
an activator and the chromatin remodeling. 


Repressor 


A repressor is a DNA-binding protein that regulates the expression of a specific gene by binding 
the operator, which is a segment of DNA that a regulator binds to. The binding of the repressor 
blocks the attachment of RNA polymerase to the promoter and prevents the transcription of the 
genes. If an inducer molecule is present, it can interact with the repressor and inhibit its action 
by detaching it or preventing its binding to the operator. 


Activator 


An activator is a DNA-binding protein that regulates one or more genes by increasing the 
transcription rate. RNA polymerase binds to the promoter region of the gene, forming a complex 
which sometimes proceeds to gene transcription. An activator recruits the RNA polymerase to 
its promoter region. 


Chromatin remodeling 


The chromatin remodeling is another way to regulate the expression of a specific gene in eukary- 
otes. Chromatin is the complex of DNA and histone proteins with which it associates. Histones 
are highly alkaline proteins found in eukaryotic cell nuclei that package and order the DNA into 
structural units called nucleosomes. Chromatin on one side serves as a way to condense DNA 
within the cellular nucleus and, on the other side, as a control of gene expression. In eukaryotes 
specific genes are not expressed if they are not accessible by RNA polymerases and transcription 
factors. Therefore, chromatin must open to let transcription to take place. More in detail, when 
the surrounding chromatin is in an acetylated state, i.e. it is open, the gene can be transcribed, 
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while the transcription is repressed when the chromatin is in a condensed state. The process of 
“opening” the chromatin complex is called chromatin remodeling. 


At nowadays there are no direct proofs of the role of chromatin remodeling to stochastic 
activation/deactivation of gene expression in eukaryotes. Nevertheless, a number of studies have 
tried to find such a connection by studying different effects by indirect means. |Becskei et al. 


2005] [4] and 2006| [61] have studied the correlations between proximally located genes, 
focusing therefore on positional effects. |Raser and O’Shea| |2004] [62] and of 2006] [73] 


have altered, on the other side, the behavior of chromatin-remodeling agents and then analyzed 
the effects. Raser and O’Shea [62] hypothesize that chromatin remodeling is the key regulation 
mechanism for certain eukaryotic promoters. In particular, the noise strength profile of promoter 
PHOS of Saccharomyces cerevisiae is approximated by the noise profile of a three-stage model, 
introduced in Section [2.A.2] In this specific case gene activation is an infrequent process with 
respect to transcription and the active promoter is stable. The PHOS is then supposed to depend 
on a slow promoter transition. In the inactive state the PHO5 promoter exhibits positioned 
nucleosomes. Chromatin complexes contribute then to the nucleosome removal, causing gene 
activation. 


The DNA methylation is another way to control 
transcription. In this process, cytosine bases of eukary- 
otic DNA is converted in 5-methylcytosine, resulting in 
the repression of the transcription. 


We have here tried to show that the control of gene Active gene 
expression is complex and it results to be much more 
complicated when looking more in detail the mecha- 
nisms. The details vary between different genes and LA LL LL LA Te MT MT TT 
different species. Genes may show different states, but 
in first approximation we will suppose that each gene 
may show only two possible states: active or inactive. 


The number of copies of a gene within bacteria is a 
fundamental factor and should be considered in a model 
describing the expression of a specific gene. When bac- 
teria are growing, they duplicate their DNA, that leads 
to a number of at least two copies per gene, since the 


genetic information has to be split between daughter 
cells. Inactive gene 


Nevertheless bacteria like Escherichia coli are char- 
acterized by cell division cycles of 20 minutes, while the 
DNA replication requires about 40 minutes. So those 
bacteria are obliged to have more than two copies of 
DNA and have parallelized the tasks in order to increase 
the number of genes copies per cell. 


Figure B.1: Gene activation. Schema 
of gene activation/deactivation. When 
the repressor, red component, binds to 
the gene it inhibits the mRNA tran- 


scription, while the gene is activated 
So in case of “normal” growth, only one copy of a when the repressor unbinds. 


gene is present, which is doubled in about 40 minutes. 

Moreover, since the DNA replication always starts at 

same point, the origin of replication is easier to find this part of DNA with respect to the 
terminating region of replication. In “normal” growth regime the division time is smaller then 
one hour and we have regions with one, two or four copies of genes. 


In “regeneration” regime the cell division cycle takes about 20 minutes and we have up to 
eight copies of genes that are localized closer to the origin of replication. 
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B.1.2 Transcription 


The first step of gene expression is the transcription of the information of a gene into RNA. RNA 
polymerase is one of the central processes in transcription process. Bacteria contain a single type 
of polymerases, while eukaryotes contain three dinstinct types. Nevertheless, transcriptional 
mechanisms are quite similar in both species. 

Transcription begins with the binding of the RNA polymerase complex to the promoter, which 
is a particular DNA sequence at the beginning of the gene. Activation of the RNA polymerase 
complex enables transcription initiation. 

In prokaryotes transcription takes place in cytoplasm. Transcription consists of three phases: 
initiation, elongation and termination. Initiation takes place at promoter, see Section [B.2.7] for 
details, immediately upstream the gene is going to be transcribed. DNA polymerase is made by 
five subunits: 2a, 6, 6’ and o. The o subunit recognizes the “consensus sequence” in the gene 
promoter, it binds to the DNA and recruits other components of the enzymatic complex. Once 
the polymerase is bound to DNA, the o-subunit detaches and 8’ is responsible to the binding 
with the DNA. The polymerase unrolls the DNA double-helix and elongation phase starts. The 
transcription ends at terminating codon, a specific sequence downstream the transcribed gene. 
The mRNA messenger is then released in the cytoplasm due to termination protein Rho. Since 
transcription and translation both take place in cytoplasm, it may happens that translation 
starts before completion of transcription. 

The transcription process can be described through the following fundamental steps: 


1. initiation: the polymerase binds to one of the specificity factors ø to form a “holoenzyme” 
in order to attach to a specific promoter in the DNA. The more similar is a sequence to 
a “consensus sequence” the stronger is the binding to the DNA. After the first bond is 
synthesized, the RNA polymerase must clear the promoter (this phase is called promoter 
clearance). During this time it may occurs that a truncated transcript, called abortive 
initiation, is released; 


2. elongation: after the promoter clearance, the polymerase assembles in a controlled fashion 
the mRNA chain; 


3. termination: two kind of termination processes exist: the p-independent transcription 
termination or the p-dependent transcription termination. The first involves terminator 
sequences within the RNA that signals the RNA polymerase to stop. The latter uses the 
p terminator factor to stop RNA synthesis. 


In eukaryotes transcription takes place in a more complex manner. In particular, there are 
involved three types of RNA polymerases: type I, type IT and type III. These polymerases differ in 
the number and type of sub-units they are made of, as well as the class of RNAs they transcribe. 
Type I transcribes ribosomal RNAs (rRNA), type II transcribes messenger RNAs (mRNAs) and 
most of the small nuclear RNAs, type III transcribes transfer RNAs (tRNAs) and other small 
RNAs. Since RNA polymerase II transcribes protein-encoding genes, it has been deeply studied 
and many experiments and models on stochastic gene expression in eukaryotes study this type 
of polymerase. 

The transcription regulation controls the frequency and the number of produced messengers. 
The gene transcription is subject to many control mechanisms; we just recall the most common. 
The specificity factors alter the specificity of RNA polymerase for a given promoter or set of 
promoters, making it more or less likely to bind to them, i.e. sigma factors in prokaryotic 
transcription. Repressors bind to non-coding sequences on the DNA strand that are close to or 
overlapping the promoter region, impeding RNA polymerase’s progress along the strand, thus 
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impeding the expression of the gene. General transcription factors position RNA polymerase at 
the start of a protein-coding sequence and then release the polymerase to transcribe the mRNA. 
Activators encourage the expression of the gene by enhancing the interaction between RNA 
polymerase and a particular promoter. More in detail, they do this by increasing the attraction 
of RNA polymerase for the promoter through interactions with subunits of the RNA polymerase 
or, indirectly, by changing the structure of the DNA. Finally, enhancers are sites on the DNA 
helix that are bound to by activators in order to loop the DNA bringing a specific promoter to 
the initiation complex. Enhancers are much more common in eukaryotes than prokaryotes. 

In post-transcriptional phase the regulatory machine controls the number of mRNAs that are 
translated into proteins. The stability and distribution of the different transcripts are regulated 
(post-transcriptional regulation) by means of RNA binding protein (RBP) that controls the var- 
ious steps and rates of the transcripts. These proteins achieve these events thanks to a RNA 
recognition motif (RRM) that binds a specific sequence or secondary structure of the transcripts. 

The typical hypothesis in stochastic models for gene expression is to consider transcription ex- 
ponentially distributed. This can be linked to the fact that, in order to produce a new messenger, 
the polymerase must be attached to the gene. The encounter of particles is usually modeled as 
an exponentially distributed process. Nevertheless, this modelisation does not take into account 
the fact that, once polymerase is attached, several processes have to occur in order to effectively 
start the elongation of the new mRNA. In the elongation step the polymerases combine the basis 
reading the information contained into the gene, until it reaches the terminator codon and a new 
mRNA is released into the cytoplasm. We just point out that we have given a coarse-grained 
description of the transcription process not taking into account possible control mechanisms of 
the process itself and the other cellular components or other external factors which may play a 
role. 

Using our approximate description it is clear that the exponential hypothesis is a brute 
simplification of the real situation. Nevertheless, this assumption may be justified in specific cases 
and allows to make computations. In fact, it has been observed in prokaryotes that ribosomes 
attache to the growing chain of mRNA, so the exponential hypothesis is equivalent of supposing 
that the ribosome may attach to the messenger once transcription is just started. Moreover, 
mRNAs dynamics are much faster with respect to the protein dynamics. This may lead to 
consider finer modelisation for the proteins and to accept a compromise of simplicity for the 
mRNAs dynamics. 

On the other side, this assumption may not be appropriate in all cases and for all purposes. 
In fact, the binding of RNA polymerase may change the native structure of the gene, blocking 
or facilitating transcription. In the same manner the DNA replication can change the chromatin 
structure, which can remove both activators and repressors thus changing the expression rate 
of certain genes. In eukaryotes transcription and translation take place in different parts of the 
cell, the nucleus and cytoplasm respectively. Messengers have to be transported out the nuclei 
in order to be translated, which may affect the dynamics. 


B.1.3 Translation 


Translation consists of the following steps: 


1. initiation: it involves the assemblage of components ribosomal subunits (50S and 30S), 
mRNA, the first aminoacyl tRNA, GTP and initiation factors (IF1, IF2, IF3). The ri- 
bosome has three sites (A, P and E). The A site is the entry-point for aminoacyl tRNA, 
except for the first that binds directly on the P site. In the P site the peptidyl tRNA is 
formed and in the E site the uncharged tRNA detaches from the ribosome; 
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2. elongation: it is a controlled process in which the polypeptide chain is elongated with 
the addition of amino acids to the carboxyl end of the growing end. Elongation involves 
several elongation factors, a conformal change, bond formations, etc. The aminoacyl tRNA 
attaches in the A site, then moves to the P site where the polypeptide is attached to the 
growing chain and the uncharged tRNA is moved to the E site where exits from the complex; 


3. termination: it occurs when one of three terminating codons moves to the A site. These 
codons are not recognized by any tRNA but by the so called release factors. These factors 
trigger hydrolysis of the ester bond and release the newly produced protein in the cytoplasm. 
The ribosome recycling step is responsible of ribosome disassembly in such a way to be 
ready to start translation of other messengers. 


Translation is carried out by more than one ribosome simultaneously. Ribosomes are rather 
closely spaced on mRNAs, with their average distance not exceeding a few ribosomes diameters, 
see [28]. This has been confirmed by measurement on individual Æ. Coli genes such as lacZ 
[35]. Small ribosome spacing means a high frequency translation initiation (FTI), i.e. hence 
the majority of bacterial mRNAs seems endowed with FTIs that fall only slightly short of the 
maximum that would result in ribosome queueing. 


B.1.4 mRNA degradation 


The process of messenger degradation is an essential function for recycling nucleotides and for 
regulating the level of gene expression and is performed by RNase. The decay process occurs in 
short time scales, i.e. the typical half-life of a messenger is of about two minutes at 37°C in most 
cases. 

The instability of mRNAs was first discovered in the early 60’s in E. Coli and set this type 
of RNA apart from the previously known as stable RNA. It was then assumed a conservation 
of mRNA decay mechanism among different prokaryotes, but recent works [3] have showed a 
specificity in different prokaryotes. In E. Coli mRNA turnover, the major ribonucleases known 
to participate are RNase E and several 3’ — 5’ exoribonucleases (the hydrolytic enzymes RNase 
II and RNase R and the phosphorolytic enzyme PNPase). In B. Subtilis the ribonucleases 
RNase Jı and RNase Y seem to play a major role in the initiation of mRNA decay, however 
much has to be understood to clarify the primary role of these ribonucleases. 

In many bacterial organisms, the decay mechanism is unilateral, as for instance in the lacZ 
gene, which decays in a net 5’ to 3’ direction. This has been proven in other organisms thanks 
to the observation of lag. More precisely, it has been observed that the degradation started with 
few minutes lag, corresponding to the time needed to complete transcription of the specific gene. 
This evidence brings to conclude that the degradation follows a specific direction. 


B.1.5 Protein degradation 


Protein degradation is regulated and selective process within cell: proteins are continuously 
degraded to amino acids, with protein-specific rates. The degradation machinery differs between 
eukaryotes and prokaryotes and we try to give few key insight on this complex process, using the 
review article of Goldberg [21] as reference. 

Misfolded proteins can originate by mutations or post-synthetic events and can be highly 
toxic for cells. Abnormal proteins may result also from errors in transcription or translation. 
The accumulation of abnormal proteins has been shown to cause JNK kinases and apoptosis. It 
has also been hypothesized that the ability of unfolded proteins to activate cell-death program 
could be associated with neurodegenerative diseases and with certain cancers. 
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Prokaryotes have developed an elaborate proteolytic machinery to quickly destroy misfolded 
proteins. This machinery is way more complex than the proteases, which is an enzyme that 
conducts proteolysis. In fact, if such enzymes were allowed into the cytosol, “they would quickly 
convert the cell into a bag of amino acids” [21]. Evolution had to solve the problem to provide 
a machinery capable of degrade damaged or misfolded proteins without destroying essential 
constituents. 

The ability of cells to degrade misfolded proteins was found in early 70s, which has allowed 
to understand that protein structure determines not only its specific cellular function, but also 
its intracellular stability. A conformal change in hemoglobin leads their lifespan to decrease from 
about 110 days to an half-life of about 10 minutes. 

In 80s it was found that bacteria were able to degrade external proteins which are considered 
abnormal, explaining why medical proteins such as insulin cannot accumulate in E. Coli. 

Incomplete proteins, which can arise for mRNA mistakes or mutations, are quickly degraded. 
Bacteria are endowed with a mechanism which is capable to target for degradation incomplete 
polypeptide while they are still on stalled ribosomes. 

The folding of newly synthesized proteins is a key process, since the protein conformation 
is crucial for it to accomplish its function. This process can take several minutes and the fail- 
ures of folding are not negligible: in fact, about 30% of proteins in eukaryotes might undergo 
degradation, possibly for failures in the folding process or of multimer assembly. However, it is 
difficult to identify anomalous proteins, because of the short lifetime of some regulatory proteins. 
Degradation has been hypothesize to have also a control mechanism, in order to assure that 
multimeric complexes are present in the proper stoichiometry. In fact, it has been observed that 
free ribosomal subunit are really unstable, unless they are assembled in ribosomes. 

Mature proteins are exposed to the risk of denaturation or chemical damage, which can 
be caused by different means in a cell, like high temperature, high salt concentrations, other 
unfolded proteins, etc. Most part of proteins show long lifetimes: in mammals about 1 — 2% of 
these proteins are degraded each hour. 

The ubiquitin-proteasome pathway is one of the most important selective degradation ma- 
chinery for abnormal proteins in eukaryotes. Ubiquitin is a tiny protein that have earned the 
nickname of “the kiss of death”, since it tags the protein to be degraded, which occurs by means 
of the 26S proteasome. All this machinery requires ATP, which is used to activate ubiquitin. It is 
still unclear whether the ubiquitin tag is essential for the proteasome degradation for all abnor- 
mal proteins, since this process occurs efficiently in bacteria, which lack ubiquitin and eukaryotic 
26S proteasomes are able degrade unfolded proteins in vitro, without the ubiquitin bond. 

In eukaryotes the 265 proteasome is the only enzyme used to degrade various types of sub- 
strate. This enormous protein contains a central cavity where the lysis occurs, ensuring protein 
digestion to be isolated from the cytosol and other cellular components. The entry channel of 
the 26s is narrow, in order to exclude normal globular cells to be degraded. Moreover an ATP- 
dependent mechanism is used to recognize and linearize proteins inside the enzyme in order to 
enter in the 20S subunit and be degraded. Proteolysis in prokaryotes involves several proteases. 
Nevertheless the proteasome of Archea works together other cellular compounds which select, 
unfold and translocate abnormal proteins. We may think that the case of eukaryotes is the result 
emerged by evolution process. 


Stable and unstable proteins 


Proteins are known to degrade rapidly when their conformations are altered by mutations or 
denaturation, ...since all these modifications are likely to hinder folding or to disrupt tertiary 
structures, making these proteins predisposed to be degraded. 
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However, normal cells could show a high variability in their half-lives and several hypothesis 
have been proposed, see [? | for further details. In particular, the studies of stable and unsta- 
ble proteins has led to find common characteristics or sub-structures that are supposed to be 
responsible of the instability of the associated polypeptides. In the manuscript we refer to stable 
proteins those proteins whose lifetime is comparable (or bigger) to cell doubling time, while we 
refer to unstable proteins to those who are highly likely to be degraded rapidly. 


B.2 Biological glossary 


Here we give short descriptions of few biological components and mechanisms whose we refer to 
during the manuscript. Our aim is not to be exhaustive but to facilitate the reading. 


B.2.1 16S ribosomal RNA 


16S ribosomal RNA (16S rRNA) is a component of the 30S subunit of prokaryotic ribosomes. 
Multiple sequences of 16S rRNA can exist within a single bacterium. The 16S rRNA has several 
functions 


e it has a structural role, acting as a scaffold defining the positions of the ribosomal protein; 


the 3’ end contains the anti-Shine-Dalgarno sequence (see |B.2.14), which binds upstream 
to the AUG start codon on the mRNA; 


it interacts with 23S, aiding in the binding of the two ribosomal subunits (50S and 30S); 


e it stabilizes the correct codon-anticodon pairing in the A-site. 


B.2.2 (6-galactosidase 


B-galactosidase, or beta-gal or B-gal, is the enzyme resulting from the expression of lacZ gene, 
which is under the control of the lac promoter. The (-galactosidase serves as hydrolase enzime 
and catalyses the hydrolisis of B-galactosides into monosaccharides. Moreover, B-gal converts 
lactose into allolactose, by transglycosylation. The allolactose is an inducer of the lac operon in 
E. Coli. 

B-galactosidase has been the standard reporter for gene expression in both prokaryotes and 
eukaryotes [5], [8], [50]. A single molecule of beta-gal can produce a large number of fluorescent 
product molecules, leading to amplification of the fluorescent signal. 


B.2.3 DNA 


The deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in 
development and functioning of all known living organisms, with except of some viruses. The 
DNA segments that carry this genetic information are called genes, but other DNA sequences 
have structural or regulatory purposes. 

It consists of two long polymers of nucleotides, with backbones made of sugars and phosphate 
groups joined by ester bonds. Attached to each sugar is one of four types of molecules, called 
bases: the sequence of these bases encodes the DNA information. 

DNA is organized into long structures called chromosomes. The chromosomes are duplicated 
before cells divide in a process called DNA replication. Prokaryotes organisms store their DNA 
in the cytoplasm, while in eukaryotic it is stored in cell nucleus and in organelles. 
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DNA usually exists as a pair of molecules that are held tightly together and have the shape of 
a double helix. The nucleotide repeats contain both the segment of the backbone of the molecule 
and a base which interacts with the other DNA strand in the helix. A base linked to a sugar is 
called nucleoside, while a base linked to a sugar and one, or more, phosphate group is called a 
nucleotide. 

The backbone of the DNA is made of alternating 
phosphate and sugar (2-deoxyribose i.e. a 5 carbon 
sugar) residues. The sugars are joined together by phos- 
phate groups that form phosphodiester bonds between 
the third and fifth carbon atoms of adjacent sugar rings. 
These asymmetric bonds give to the DNA strand a “di- 
rection”. In the double helix the direction of nucleotides 
in one strand is opposite to their direction in the other 
strand i.e. the strands are antiparallel. The asymmetric 
ends of DNA strands are called 5’ “five prime”, which 
has a terminal phosphate group, and 3’ “three prime”, 
which has a terminal hydroxyl group. 

The four bases found in DNA are: adenine (A), cy- 
tosine (C), guanine (G) and thymine (T). These bases 
are attached to the sugar/phosphate to form a complete 


nucleotide. 
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B.2.4 Gene 
Figure B.2: The 5’ end designates the 


The gene is a molecular unit of heredity of a living or- end of DNA or RNA strand that has 
ganism, each gene corresponding to various biological the fifth carbon of the sugar-ring of the 
traits. More specifically it has been defined by Pearson deoxyribose or ribose at its terminus. 
et al. [56] as follows The 3’ end of a strand is so named due 

to it terminating at the hydroxyl group 


A gene is a locatable region of genomic se- of the third carbon in sugar-ring. 


quence, corresponding to a unit of inheri- 
tance, which is associated with regulatory 
regions, transcribed regions, and or other 
functional sequence regions. 


A constitutive gene is a gene that is transcribed con- 
tinually, i.e. there is no control over its expression. A housekeeping gene is typically a constitutive 
gene that is transcribed at a relatively constant level. The housekeeping gene’s products are typi- 
cally needed for maintenance of the cell. It is generally assumed that their expression is unaffected 
by experimental conditions. 

A facultative gene, as opposite to a constitutive gene, is transcribed only when needed. In 
particular, an inducible gene is a gene whose expression is either responsive to environmental 
change or dependent on the position in the cell cycle. 


B.2.5 Inducer 


Different proteins can affect gene expression by promoting or preventing mRNA transcription. 
In prokaryotes these cells often intervene at the beginning of a specific gene, called operator, 
which is the DNA sequence to which polymerases attach to synthesize mRNAs. 
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Inducers cause gene transcription to start, by disabling repressor proteins. More in detail 
inducers bind to repressors, change their shape and prevent them to bind to DNA, allowing gene 
transcription. Inducers can be sometimes modulated by activators which bind to inducers and 
allow them to attach to the DNA strand. 


B.2.6 Operon 


An operon is a functioning unit of gnomic DNA containing a cluster of genes under the control 
of a single promoter (see Section for details). This concept was introduced by Monod, 
Jacob et alii in 1960 [31]. The authors proposed a model in which genes are controlled via the 
operons through a single feedback regulatory mechanism: repression. It was then found that the 
regulatory mechanisms may be much more complicate. Nevertheless, this concept has revealed 
fundamental for molecular biology. 

The operon is composed of three basic structures: 


1. the promoter, see Section for details, to which the RNA polymerase binds to initiate 
transcription, 


2. the operator, which is a small sequence between the promoter and the genes, to which 
regulators bind. 


3. the structural genes, which are the genes that are regulated by the operon. 


Operons occur in both prokaryotes and eukaryotes. The genes contained in the operon are 
transcribed together into a messenger. This mRNA can be translated or cut into several mRNAs, 
each corresponding to a specific protein. The result is that genes contained in the same operon 
are expressed or repressed all together. 


B.2.7 Promoter 


In order for the transcription to take place, the enzyme that synthesizes RNA, known as RNA 
polymerase, must attach to the DNA near a gene. Promoters contain specific DNA sequences 
and response elements which provide a secure initial binding site for RNA polymerase and for 
proteins called transcription factors that recruit RNA polymerase. 

In bacteria, the promoter is recognized by RNA polymerase and an associated sigma factor 
o, which in turn is often brought to the promoter DNA by an activator protein binding to its 
own DNA binding site nearby. 


B.2.8 Ribosomal Binding Site (RBS) 


A ribosomal binding site (RBS) is a sequence on mRNA that is bound by the ribosome when 
initiating protein translation. The sequence is complementary to the 3’ end of the rRNA. The 
ribosome searches for this site and binds to it through base-pairing of nucleotides and then begins 
the translation process. 


B.2.9 Ribosome 


The ribosome is a large and complex molecular machine, that serves as the primary site of 
biological protein synthesis (translation). More in detail, ribosomes link together the amino 
acids as specified by the nucleotide sequence of mRNAs. 
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Ribosomes consist of two subunits: the large subunit (LSU) and small subunit (SSU). 
Prokaryotic ribosomes are around 20nm in diameter and are composed of 65% ribosomal RNA 
and 35% ribosomal proteins. For prokaryotes the LSU is called 50S and the SSU is called 308, 
while the ribosome is referred to as 70S. Note that the S units of the subunits cannot simply be 
added because they represent measures of sedimentation rate, that is affected by the component 
shape and by its mass. 

The 30S ribosomal subunit contains the 16S rRNA, see while the 50S contains both 
the 5S and 23S rRNA. The 3’ end of the 16S rRNA (in a ribosome) binds to a sequence on the 
5’ end of mRNA called the Shine-Dalgarno sequence (see Section [B.2.14). 


B.2.10 RNA 


Ribonucleic acid or RNA is a nucleic acid polymer consisting of nucleotide monomers and per- 
form multiple vital roles in the coding, decoding, regulation and expression of genes. RNA 
comprises the nucleic acids and, together with DNA and proteins, constitute the three major 
macromolecules essential for all cells. RNA acts as a messenger (mRNA) between DNA and the 
protein synthesis complexes (ribosomes), forms portions of ribosomes and acts as an essential 
carrier molecule for amino acids to be used in protein synthesis (tRNA). RNA is assembled as a 
chain of nucleotides but, unlike the DNA, it is single-stranded. 


B.2.11 Messenger RNA (mRNA) 


The messenger RNA (mRNA) is a chemical “blueprint” for a specific protein. More specifically, a 
messenger conveys the genetic information from the DNA to the ribosome, specifying the amino 
acid sequence of the protein products. 

The genetic information in mRNA is encoded in the sequence of nucleotides, which are ar- 
ranged into codons, a triplet of nucleotides. Each codon encodes for a specific amino acid, except 
the stop codons which terminate the protein synthesis. 

Because eukaryotic transcription and translation occur in different compartments of the cell, 
eukaryotic messengers must be exported from the nucleus to the cytoplasm and require a post- 
transcriptional processing before translation can take place. On the other hand, no processing is 
required for prokaryotic messengers and translation can initiate even before the mRNA elongation 
has been completed. 

Just before the AUG start codon, prokaryotic mRNAs have a common sequence called the 
Shine-Dalgarno sequence, see The Shine-Dalgarno sequences are complementary to a 
sequence contained in the 16S ribosomal RNA (rRNA), see and help align the mRNA 
with the ribosome to properly orient the molecules for translation initiation. 


B.2.12 Ribosomal RNA (rRNA) 


Ribosomal ribonucleic acid (rRNA) is the RNA component of the ribosome, the protein manufac- 
turing machinery of all living cells. Ribosomal RNA provides a mechanism for decoding mRNA 
into amino acids and interacts with tRNAs during translation by providing peptidyl transferase 
activity. The tRNAs bring the necessary amino acids corresponding to the appropriate mRNA 
codon. 


B.2.13 Transfer RNA (tRNA) 


A transfer RNA (tRNA) is a type of RNA molecule that helps decode a messenger RNA (mRNA) 
sequence into a protein. More specifically, each amino acid is specified by a three-nucleotide 
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mRNA sequence (codon) and each codon is recognized by a specific tRNA, which contains a 
specific anticodon triplet sequence. When the tRNA is associated with the corresponding amino 
acid, we refer to it as aminoacyl-tRNA or charged tRNA. The charged tRNA delivers the amino 
acid to the ribosome for incorporation into the polypeptide chain that is being produced. The 
uncharged tRNA is then released and ribosome proceeds to the next codon and continues trans- 
lation step. 


B.2.14 Shine-Dalgarno sequence 


The Shine-Dalgarno sequence (SD), AGGAGG, is a ribosomal binding site in the mRNA, gener- 
ally located 8 basepairs upstream of the start codon AUG and exists only in prokaryotes. This 
sequence helps recruit the ribosome to the mRNA to initiate protein synthesis by aligning it with 
the start codon. The complementary sequence (UCCUCC), is called the anti-Shine-Dalgarno se- 
quence and is located at the 3’ end of the 165 rRNA in the ribosome. 

Mutations in the Shine-Dalgarno sequence can reduce translation, due to a reduced mRNA- 
ribosome pairing efficiency. Anyway, what principally attracts ribosome to mRNA initiation 
region is apparently ribosomal protein S1, which binds to AU-rich sequences found in many 
prokaryotic mRNAs 15-30 nucleotides upstream of start-codon. 
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